SRE Engineer (Azure)
We are looking for an Azure SiteReliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud platforms. The SRE Engineer will architect, implement, and operate highly available systems with a strong emphasis on automation,observability, and security best practices.
The candidate will work closely with engineering and project teams to ensure our Azure services meet organizational objectives for performance, resilience, and cost-efficiency.
Responsibilities
- Demonstrate expertise in cloud reliability engineering, high-availability patterns, observability frameworks, and automation with a security-first mindset.
- Design, implement, and maintain SLOs, SLIs, monitoring dashboards, and automated alerting mechanisms across Azure services.
- Ensure reliability of mission-critical systems by implementing autoscaling, redundancy, failover, and resilient architectures.
- Develop automation using Terraform/Bicep, PowerShell, and Python to reduce operational toil and improve system reliability.
- Collaborate with engineering teams to support secure, reliable CI/CD pipelines and deployment processes.
- Conduct root cause analysis (RCA), implement corrective actions, and lead continuous improvement of reliability processes.
- Continuously monitor Azure resources and optimize performance, cost, and operational health based on best practices.
- Ensure all deployed workloads comply with cloud security baselines, network boundary controls, and governance frameworks (e.g., IM8, CIS, NIST).
- Improve infrastructure readiness through chaos engineering, failover tests, and resilience validation.
- Prepare operational runbooks, architecture documents, and technical guides for cloud reliability operations.
- Support Agile workflows and collaborate across teams to integrate operational excellence into the development lifecycle.
Qualifications & Work Experience
- Bachelor’s Degree in Computer/Information Science or equivalent.
- 4+ years of experience in cloud reliability/SRE role with emphasis on Azure.
- Strong understanding of Azure Monitor, Log Analytics, App Insights, AKS, VNets, Load Balancers, and HA designs.
- Hands‑on experience with IaC tools such as Terraform, Bicep, or ARM templates.
- Strong scripting capabilities (PowerShell/Python).
- Experience with CI/CD pipelines (GitHub Actions, Azure DevOps).
- Solid understanding of cloud security controls, compliance frameworks, and incident management.
- Exceptional troubleshooting and problem‑solving skills.
Skills
- Incident & Problem Management
- Configuration & Change Management
- Observability and Reliability Engineering
- Strong communication & stakeholder engagement
- Ability to work effectively across technical teams