As part of the Site Reliability Engineering (SRE) team, you’ll contribute to designing, automating, and evolving mission-critical systems. You'll combine deep systems expertise with modern software engineering practices to reduce operational toil and build resilient, self-healing services.
This is a high-impact role where your work directly affects the reliability of cloud services used by thousands of customers around the world.
Qualifications
Career Level - IC4
Responsibilities
What You’ll Do:
- Collaborate with SRE and development teams to ensure end-to-end reliability across a wide range of services and technology stacks.
- Design, write, and deploy software and automation tools that enhance availability, observability, and scalability.
- Own and evolve metrics, SLOs, SLAs, KPIs, and dashboards that track system health and customer experience.
- Build tooling to reduce manual operations and eliminate sources of toil.
- Improve CI/CD pipelines, deployment processes, and validation frameworks for reliability and efficiency.
- Review and influence architectural designs for distributed systems with a focus on resilience, performance, and fault tolerance.
- Lead and participate in post-incident reviews, capacity planning, and production-readiness assessments.
- Provide on-call support on a rotational basis (12-hour shifts, 7-day coverage).
What We’re Looking For:
- Advanced Linux systems administration
- Strong coding skills in Python (automation-focused)
- Intermediate experience with Bash/Shell scripting
- Familiarity with networking principles and distributed systems behavior
- Basic to intermediate knowledge of databases (e.g., SQL, NoSQL)
- Understanding of unit testing and modern software engineering practices
- Experience with CI/CD pipelines and deployment automation
- Comfortable working in Agile development environments
Nice to Have:
- Exposure to monitoring/observability tools (e.g., Prometheus, Grafana, New Relic)
- Experience building internal tools for operational efficiency
- Participation in SRE culture: blameless postmortems, runbooks, and service design reviews