Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer (SRE) _Contract

NTT

Singapore

On-site

SGD 60,000 - 90,000

Full time

15 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global technology service provider in Singapore seeks a professional to maintain and optimize application monitoring infrastructure, support migration to the latest OpenShift versions, and implement alerting systems for applications. The ideal candidate will have strong skills in Kubernetes, Prometheus, and multi-environment CI/CD pipelines. This role offers a dynamic development environment focused on observability culture and troubleshooting technologies.

Qualifications

  • Experience with open source-based application monitoring infrastructure.
  • Knowledge of application instrumentation libraries and frameworks.
  • Ability to optimize and troubleshoot Kubernetes/OpenShift platforms.

Responsibilities

  • Maintain application monitoring infrastructure and optimize solutions.
  • Support migration to the latest OpenShift versions.
  • Implement alerting infrastructure and define alert rules.

Skills

Development knowledge of bash scripting
Java
Python
React
Angular
Elastic Search
Prometheus

Tools

Kubernetes
OpenShift
ELK stack
Grafana
Otel
Jaeger
Zipkin
Job description
Must have skills
  • Development knowledge of bash scripting, Java , Python, React or Angular.
  • Working experience on Elastic Search, Prometheus
Job description
  • Maintain open source-based application monitoring infrastructure. Enhance, optimize, and migrate to new solutions if required.
  • Support application teams to migrate to latest OpenShift versions, perform deployment of stateful/stateless apps, and troubleshoot issues in Kubernetes/OpenShift platforms.
  • Work with application developers to implement application instrumentation libraries and frameworks.
  • Maintain metrics data store using Prometheus. Perform administration and tuning like cardinality optimization, resource optimization.
  • Maintain distributing tracing infrastructure like Otel, Jaeger, Zipkin, etc. Perform administrative functions and tuning like sampling strategy. Troubleshoot distributed tracing in microservices.
  • Perform production support activities of enterprise logging platforms like ELK stack, Grafana LGTM stack.
  • Implementing alerting infrastructure, integrate with PagerDuty, MS teams and any other software which needs alert-based mitigation/action. Assist application support team to define alerting rules for enterprise business apps.
  • Deploy and do administration of visualization tools like Grafana/Elastic. Create dashboarding templates which can be reused, Implement RBAC for the entire userbase.
  • Educate and implement observability culture in dev community. Assist them identifying golden signals, defining SLI, SLO for enterprise applications, calculate error budgets, MTTD, and MTTR.
  • Troubleshoot the infra issues in the observability infrastructure in Linux VMs and Kubernetes PODs, Setup and secure reverse proxies, secure all application endpoints with TLS, enable MFA, LDAPS, OAuth based on requirement.
  • Configure CI/CD pipeline for all the monitoring infrastructure and services. Modify and extend existing pipeline to cater multiple environments/regions.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.