Enable job alerts via email!

SRE Lead

Chubb Insurance Hong Kong

Malaysia

On-site

MYR 90,000 - 120,000

Full time

2 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global insurance firm is seeking a Senior Site Reliability Engineer in Malaysia to enhance the reliability and performance of their systems. The role involves developing automation tools, monitoring system health, and collaborating closely with development and operations teams. Candidates should have strong skills in Linux systems, programming, and experience with cloud platforms and monitoring tools. This full-time position offers a dynamic work environment focusing on scalability and resilience.

Qualifications

Strong knowledge of Linux/Unix systems and networking.
Proficiency in programming languages such as Python, Ansible, PowerShell, .Net, Java.
Experience with cloud platforms (e.g., Azure, AWS).
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Expertise in monitoring and observability tools (e.g., App Dynamics, Dynatrace, Grafana, ELK stack).
Understanding of CI/CD pipelines and automation frameworks.
Ability to work under pressure and handle critical incidents effectively.

Responsibilities

Design, build, and maintain scalable and reliable systems.
Develop and maintain automation tools for deployment and monitoring.
Set up and maintain monitoring tools to track system health and performance.
Work closely with development teams to ensure reliability in design.
Analyze system usage and plan for future capacity needs.

Skills

Linux/Unix systems

Python

Ansible

PowerShell

.Net

Java

Cloud platforms

Containerization

Kubernetes

Monitoring tools

CI/CD pipelines

Problem-solving

Communication skills

Distributed systems

Database systems

Incident management frameworks

Certifications in cloud technologies

Tools

App Dynamics

Dynatrace

Grafana

ELK stack

Our Platforms Team is at the forefront of innovation, creating technology solutions that empower multiple business lines across the organization. We are looking for a senior SRE to be supporting our applications deployed across the globe.

As an SRE practitioner, you will work to improve the reliability, availability, and performance of systems and services. You will collaborate with development and operations teams to design, implement, and maintain scalable and resilient infrastructure. Your role will involve automating processes, monitoring systems, and responding to incidents to ensure seamless user experiences.

Key Responsibilities

System Reliability and Performance

Design, build, and maintain scalable and reliable systems.
Monitor system performance and proactively address bottlenecks or issues.
Implement strategies to improve system uptime and reduce downtime.

Automation and Tooling

Develop and maintain automation tools for deployment, monitoring, and incident response.
Create scripts and workflows to reduce manual intervention and improve efficiency.
Respond to system outages and incidents, performing root cause analysis and implementing fixes.
Develop and maintain runbooks and documentation for incident response.

Monitoring and Observability

Set up and maintain monitoring tools to track system health and performance.
Define and measure Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Collaboration and Communication

Work closely with development teams to ensure systems are designed with reliability in mind.
Collaborate with operations teams to improve deployment processes and system management.

Capacity Planning and Scaling

Analyze system usage and plan for future capacity needs.
Implement solutions to handle traffic spikes and ensure scalability.
Identify areas for improvement in system architecture and processes.
Advocate for best practices in reliability engineering and DevOps.

Qualifications

Strong knowledge of Linux/Unix systems and networking.
Proficiency in programming languages such as Python, Ansible, PowerShell, .Net, Java. Etc.
Experience with cloud platforms (e.g., Azure, AWS).
Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Expertise in monitoring and observability tools (e.g., App Dynamics, App Insights, Dynatrace, Grafana, ELK stack).
Understanding of CI/CD pipelines and automation frameworks.
Problem‑solving skills and ability to perform root cause analysis.
Excellent communication and collaboration skills.
Experience with distributed systems and microservices architecture.
Knowledge of database systems (SQL and NoSQL).
Familiarity with incident management frameworks (e.g., ITIL, SRE best practices).
Certifications in cloud technologies or DevOps tools.
Analytical mindset with a focus on reliability and scalability.
Passion for automation and reducing manual work.
Ability to work under pressure and handle critical incidents effectively.
Commitment to continuous learning and staying updated on industry trends.

Job Info

Job Identification 26526
Job Schedule Full time
Regular or Temporary Regular
Job Category Infrastructure Engineering
Business Unit Malaysia
Legal Employer Chubb Business Services Malaysia Sdn Bhd

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs