Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer - HM: Mukesh

NTT Data Singapore

Singapore

On-site

SGD 80,000 - 120,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A prominent tech company in Singapore is looking for a Site Reliability Engineer to enhance the reliability and performance of their services. The ideal candidate will work closely with engineering teams to improve production services, drive observability initiatives, and implement best practices in SRE and Chaos Engineering. Requirements include familiarity with ITIL processes, leadership skills, scripting knowledge in Bash/Python, and a background in managing vendor and project relationships. This role offers a crucial opportunity to impact overall service health.

Qualifications

  • Good understanding of ITIL & SRE processes.
  • Leadership in working with application teams.
  • Ability to establish deployment standards.
  • Strong project management skills.
  • Agile, AWS certification preferred.
  • Ability to create scripts for infra deployment.
  • Familiarity with SRE & Chaos Engineering principles.

Responsibilities

  • Improve availability, reliability, and performance of services.
  • Drive observability for applications.
  • Reduce operational toil.
  • Set up SLI, SLO and Error budgets.
  • Deploy SRE enablers/initiatives.

Skills

ITIL understanding
Leadership skills
People management
Vendor management
Project management
Agile methodology
Bash scripting
Python scripting
Interpersonal skills
Communication skills

Tools

AWS
Job description
About the job Site Reliability Engineer - HM: Mukesh

As a Site Reliability Engineer you will be filling a mission-critical role ensuring that our systems are healthy, monitored, automated, fault tolerant and designed to scale.

You will collaborate and work closely with engineering teams to continually improve our production services, facilitating fast delivery of new products, and reducing downtime.

Key Responsibilities:

  • Drive Site Reliability Engineering agenda to improve availability, reliability, and performance of services
  • Drive observability for our applications.
  • Drive optimise-operate initiative, example, reduction of operation toil
  • Work with application teams in setting up SLI, SLO and Error budget for their applications
  • Work with enterprise team in deploying SRE enablers/initiatives.

Requirements:

  • Have a good understanding of ITIL & SRE processes & practices
  • Have good leadership skills in working with application teams and service providers in defining infrastructure deployment plan, cutover/migration strategy and test plan.
  • Able to formulae and establish infrastructure deployment standards.
  • Good people management, vendor management and project management skills
  • Agile, AWS certification preferred
  • Able to create Bash/Python scripts for infra deployment
  • Must able to practice SRE & Chaos Engineering principles
  • Understands key SRE concepts such as Toil, SLI, SLO, Error Budgets, MTTD, MTTR, etc
  • Strong, committed, and reliable team player, able to take direction but also willing to contribute to discussions on design and strategy.
  • Possess strong interpersonal and communication skills to be able to deal with and form good relationships with other technology teams through day to day support and project work
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.