We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our systems. As a key member of the engineering team, you will be responsible for designing and implementing robust infrastructure, automating operations, and improving system resilience. You will collaborate closely with developers, DevOps, and security teams to create a highly available and fault-tolerant platform.
Responsibilities
- Ensure System Reliability & Performance: Monitor, troubleshoot, and optimize system performance across cloud and on-prem infrastructure.
- Automate Operations & Deployment: Develop CI/CD pipelines, infrastructure-as-code (IaC), and automated monitoring solutions.
- Incident Management & Troubleshooting: Respond to system incidents, conduct root cause analysis, and implement long-term fixes.
- Scalability & Capacity Planning: Design and implement solutions that scale with business growth and handle high traffic loads.
- Security & Compliance: Ensure systems follow best security practices, comply with regulatory requirements, and protect against vulnerabilities.
- Observability & Monitoring: Implement logging, metrics, and alerting tools (e.g., Prometheus, Grafana, Datadog) to improve system visibility.
- Collaboration & Best Practices: Work with development teams to improve software reliability and establish best practices for high-availability systems.
Requirements
- Technical Skills
- 3+ years of experience in Site Reliability Engineering, DevOps, or related fields.
- Strong expertise in cloud platforms (we are using GCP).
- Proficiency in Kubernetes, Docker, Terraform, and infrastructure-as-code (IaC) tools.
- Experience with CI/CD pipelines using Jenkins, GitHub Actions, ArgoCD, or similar tools.
- Strong monitoring & logging experience (Prometheus, Grafana, ELK, Datadog).
- Proficiency in scripting and automation (Python, Go, Bash, or similar).
- Experience with networking, load balancers, and security best practices.
- Soft Skills
- Strong problem-solving and troubleshooting abilities.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced, high-availability environment.
- Experience leading reliability initiatives and mentoring junior engineers.
- Preferred Qualifications
- Experience with service mesh (Istio, Linkerd).
- Familiarity with database reliability (PostgreSQL, MySQL, Redis, etc.).
- Previous experience in high-scale production environments (e.g., SaaS, fintech, e-commerce).