Enable job alerts via email!

Senior Site Reliability Engineer

T-Mobile

Hyderabad

On-site

INR 20,00,000 - 30,00,000

Full time

Today
Be an early applicant

Job summary

A leading telecommunications company in Hyderabad seeks an experienced Site Reliability Engineer to implement observability systems, optimize infrastructure, and mentor junior engineers. The ideal candidate will have strong experience in monitoring, CI/CD, and scripting, contributing to operational excellence in a dynamic environment.

Qualifications

  • 4-7 years in SRE, DevOps, platform, or operations engineering roles.
  • Strong hands-on experience in observability and monitoring.
  • Proficiency in scripting languages such as Python, Bash, or PowerShell.

Responsibilities

  • Implement and maintain observability and alerting systems.
  • Design and support telemetry pipelines and dashboards.
  • Improve CI/CD workflows and infrastructure automation.

Skills

Observability and monitoring
Scripting (Python, Bash)
CI/CD with GitLab
SQL and NoSQL systems
Kubernetes and container orchestration
Observability tools (Splunk, Grafana)

Education

Bachelor's degree in Computer Science or related field

Tools

GitLab
Splunk
Grafana
Prometheus
Docker
Job description

Responsibilities:

  • Implement and maintain observability, monitoring, and alerting systems for AI platforms and backend services.
  • Design and support telemetry pipelines, logging infrastructure, and dashboards (Splunk, Prometheus, Grafana, OpenTelemetry).
  • Define and monitor SLOs, SLIs, latency, availability, and throughput metrics.
  • Participate in on-call rotations, incident resolution, root cause analysis, and postmortems.
  • Improve CI/CD workflows and infrastructure automation using GitLab pipelines.
  • Optimize and scale infrastructure, including Kafka, RMQ, HAProxy, and distributed APIs.
  • Collaborate with engineering teams on governance, compliance, and secure operations.
  • Support capacity planning, cost analysis, and tuning for high-scale performance.
  • Automate repetitive tasks and reduce toil via scripting (Python, Bash, Java).
  • Contribute to runbooks, knowledge base articles, and SRE best practice documentation.
  • Mentor junior engineers and support a culture of operational excellence and reliability.

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 4-7 years in SRE, DevOps, platform, or operations engineering roles.
  • Strong hands-on experience in observability, monitoring, and distributed systems troubleshooting.
  • Proficiency in scripting languages such as Python, Bash, or PowerShell.
  • CI/CD experience with GitLab and automation across deployment pipelines.
  • Solid understanding of SQL and NoSQL systems, including Oracle DB and MongoDB.
  • Familiarity with Kubernetes, container orchestration, and hybrid cloud (Azure, AWS, GCP, OCI).
  • Experience working in high-stakes, incident-driven environments.
  • Strong working knowledge of Splunk, Grafana, Prometheus, and other observability tools.
  • Understanding of AI/ML systems, inference APIs, and LLM infrastructure is a plus.
  • Experience in platform compliance, security enforcement, and regulated domains (finance preferred).

Must Have Skills:

  • Application and Microservice: Java, Spring Boot, API, and Service Design.
  • Any CI/CD Tools: Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI.
  • App Platform: Docker and Containers (Kubernetes).
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.