Job Search and Career Advice Platform

Enable job alerts via email!

SRE Lead

Chubb Insurance Hong Kong

Malaysia

On-site

MYR 90,000 - 120,000

Full time

2 days ago
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global insurance firm is seeking a Senior Site Reliability Engineer in Malaysia to enhance the reliability and performance of their systems. The role involves developing automation tools, monitoring system health, and collaborating closely with development and operations teams. Candidates should have strong skills in Linux systems, programming, and experience with cloud platforms and monitoring tools. This full-time position offers a dynamic work environment focusing on scalability and resilience.

Qualifications

  • Strong knowledge of Linux/Unix systems and networking.
  • Proficiency in programming languages such as Python, Ansible, PowerShell, .Net, Java.
  • Experience with cloud platforms (e.g., Azure, AWS).
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Expertise in monitoring and observability tools (e.g., App Dynamics, Dynatrace, Grafana, ELK stack).
  • Understanding of CI/CD pipelines and automation frameworks.
  • Ability to work under pressure and handle critical incidents effectively.

Responsibilities

  • Design, build, and maintain scalable and reliable systems.
  • Develop and maintain automation tools for deployment and monitoring.
  • Set up and maintain monitoring tools to track system health and performance.
  • Work closely with development teams to ensure reliability in design.
  • Analyze system usage and plan for future capacity needs.

Skills

Linux/Unix systems
Python
Ansible
PowerShell
.Net
Java
Cloud platforms
Containerization
Kubernetes
Monitoring tools
CI/CD pipelines
Problem-solving
Communication skills
Distributed systems
Database systems
Incident management frameworks
Certifications in cloud technologies

Tools

App Dynamics
Dynatrace
Grafana
ELK stack
Job description

Our Platforms Team is at the forefront of innovation, creating technology solutions that empower multiple business lines across the organization. We are looking for a senior SRE to be supporting our applications deployed across the globe.

As an SRE practitioner, you will work to improve the reliability, availability, and performance of systems and services. You will collaborate with development and operations teams to design, implement, and maintain scalable and resilient infrastructure. Your role will involve automating processes, monitoring systems, and responding to incidents to ensure seamless user experiences.

Key Responsibilities
System Reliability and Performance
  • Design, build, and maintain scalable and reliable systems.
  • Monitor system performance and proactively address bottlenecks or issues.
  • Implement strategies to improve system uptime and reduce downtime.
Automation and Tooling
  • Develop and maintain automation tools for deployment, monitoring, and incident response.
  • Create scripts and workflows to reduce manual intervention and improve efficiency.
  • Respond to system outages and incidents, performing root cause analysis and implementing fixes.
  • Develop and maintain runbooks and documentation for incident response.
Monitoring and Observability
  • Set up and maintain monitoring tools to track system health and performance.
  • Define and measure Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Collaboration and Communication
  • Work closely with development teams to ensure systems are designed with reliability in mind.
  • Collaborate with operations teams to improve deployment processes and system management.
Capacity Planning and Scaling
  • Analyze system usage and plan for future capacity needs.
  • Implement solutions to handle traffic spikes and ensure scalability.
  • Identify areas for improvement in system architecture and processes.
  • Advocate for best practices in reliability engineering and DevOps.
Qualifications
  • Strong knowledge of Linux/Unix systems and networking.
  • Proficiency in programming languages such as Python, Ansible, PowerShell, .Net, Java. Etc.
  • Experience with cloud platforms (e.g., Azure, AWS).
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Expertise in monitoring and observability tools (e.g., App Dynamics, App Insights, Dynatrace, Grafana, ELK stack).
  • Understanding of CI/CD pipelines and automation frameworks.
  • Problem‑solving skills and ability to perform root cause analysis.
  • Excellent communication and collaboration skills.
  • Experience with distributed systems and microservices architecture.
  • Knowledge of database systems (SQL and NoSQL).
  • Familiarity with incident management frameworks (e.g., ITIL, SRE best practices).
  • Certifications in cloud technologies or DevOps tools.
  • Analytical mindset with a focus on reliability and scalability.
  • Passion for automation and reducing manual work.
  • Ability to work under pressure and handle critical incidents effectively.
  • Commitment to continuous learning and staying updated on industry trends.
Job Info
  • Job Identification 26526
  • Job Schedule Full time
  • Regular or Temporary Regular
  • Job Category Infrastructure Engineering
  • Business Unit Malaysia
  • Legal Employer Chubb Business Services Malaysia Sdn Bhd
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.