Enable job alerts via email!

Site Reliability Engineer - HM: Mukesh

NTT Data Singapore

Singapore

On-site

SGD 80,000 - 120,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A prominent tech company in Singapore is looking for a Site Reliability Engineer to enhance the reliability and performance of their services. The ideal candidate will work closely with engineering teams to improve production services, drive observability initiatives, and implement best practices in SRE and Chaos Engineering. Requirements include familiarity with ITIL processes, leadership skills, scripting knowledge in Bash/Python, and a background in managing vendor and project relationships. This role offers a crucial opportunity to impact overall service health.

Qualifications

Good understanding of ITIL & SRE processes.
Leadership in working with application teams.
Ability to establish deployment standards.
Strong project management skills.
Agile, AWS certification preferred.
Ability to create scripts for infra deployment.
Familiarity with SRE & Chaos Engineering principles.

Responsibilities

Improve availability, reliability, and performance of services.
Drive observability for applications.
Reduce operational toil.
Set up SLI, SLO and Error budgets.
Deploy SRE enablers/initiatives.

Skills

ITIL understanding

Leadership skills

People management

Vendor management

Project management

Agile methodology

Bash scripting

Python scripting

Interpersonal skills

Communication skills

Tools

AWS

About the job Site Reliability Engineer - HM: Mukesh

As a Site Reliability Engineer you will be filling a mission-critical role ensuring that our systems are healthy, monitored, automated, fault tolerant and designed to scale.

You will collaborate and work closely with engineering teams to continually improve our production services, facilitating fast delivery of new products, and reducing downtime.

Key Responsibilities:

Drive Site Reliability Engineering agenda to improve availability, reliability, and performance of services
Drive observability for our applications.
Drive optimise-operate initiative, example, reduction of operation toil
Work with application teams in setting up SLI, SLO and Error budget for their applications
Work with enterprise team in deploying SRE enablers/initiatives.

Requirements:

Have a good understanding of ITIL & SRE processes & practices
Have good leadership skills in working with application teams and service providers in defining infrastructure deployment plan, cutover/migration strategy and test plan.
Able to formulae and establish infrastructure deployment standards.
Good people management, vendor management and project management skills
Agile, AWS certification preferred
Able to create Bash/Python scripts for infra deployment
Must able to practice SRE & Chaos Engineering principles
Understands key SRE concepts such as Toil, SLI, SLO, Error Budgets, MTTD, MTTR, etc
Strong, committed, and reliable team player, able to take direction but also willing to contribute to discussions on design and strategy.
Possess strong interpersonal and communication skills to be able to deal with and form good relationships with other technology teams through day to day support and project work

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Top positions