Overview
We are looking for an experienced Product Site Reliability Engineer (SRE) to help ensure the performance, scalability, and reliability of our customer-facing products and platforms. The SRE designs resilient systems, automates workflows, and builds observability into the product lifecycle to enable fast-paced innovation without compromising stability.
Responsibilities
- System Reliability & Performance
- Ensure availability, latency, scalability, and overall system health aligns with SLAs and SLOs.
- Continuously improve monitoring, alerting, and observability capabilities.
- Lead root cause analysis and conduct blameless postmortems.
- Develop and maintain incident response playbooks to reduce MTTD and MTTR.
- Automation & Tooling
- Automate operational tasks to reduce manual work and improve efficiency.
- Build and maintain CI / CD pipelines and infrastructure as code (IaC) for seamless product delivery.
- Collaboration with Product & Engineering
- Work closely with engineering teams to embed reliability into product design.
- Promote best practices such as chaos testing, capacity planning, and progressive deployment strategies (blue / green, canary releases).
- Define, measure, and track key reliability metrics (SLIs, SLOs, error budgets).
- Identify and implement infrastructure and architectural improvements to enhance system resilience.
Required Skills & Experience
Technical Skills
- Deep knowledge of cloud platforms (AWS, GCP, or Azure).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Proficiency in Infrastructure as Code tools (Terraform, Ansible, or similar).
- Expertise in CI / CD tools (e.g., Jenkins, GitHub Actions, GitLab CI).
- Familiarity with observability and monitoring tools (Prometheus, Grafana, Datadog, New Relic).
- Strong scripting and programming skills (Python, Go, Bash, or similar).
- Understanding of distributed systems, networking, and database reliability (SQL / NoSQL).
Professional Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering.
- Strong analytical and problem-solving mindset.
- Excellent communication and collaboration skills across cross-functional teams.
- Demonstrated experience in incident management and conducting postmortems.
Seniority
Employment Type
Job Function
- Engineering and Information Technology