Job Description
Lead strategic initiatives to ensure the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires expertise in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational excellence.
As a Cloud Site Reliability Engineer (CSRE), you will influence cloud reliability strategies, mentor junior engineers, and lead impactful projects. This position reports directly to the VP of Cloud Services and requires a proactive, collaborative approach to meet operational and strategic goals.
Responsibilities
- Lead and resolve complex technical issues involving our client's products and Azure cloud environment.
- Design and implement operational enhancements to improve resiliency and system reliability.
- Conduct Root Cause Analysis (RCA) for high-severity incidents and lead initiatives to prevent recurrence.
- Represent the organization in external client escalation calls, providing guidance and solutions.
- Architect and optimize cloud infrastructure for performance, scalability, and cost-efficiency.
- Manage and scale container orchestration platforms such as AKS and OpenShift.
- Implement advanced monitoring solutions and integrate predictive analytics for proactive issue resolution.
- Develop automation strategies to streamline operations and incident responses.
- Maintain documentation of cloud architectures, processes, and incident strategies.
- Mentor and coach junior engineers, fostering continuous learning and innovation.
- Drive strategic initiatives through collaboration with cross-functional teams.
Must Have
- Bachelor's degree in Computer Science, Engineering, or related field.
- 12+ years of experience in cloud support or operations.
- Expertise in Microsoft Azure or equivalent cloud platforms.
- Experience with container orchestration systems like AKS or OpenShift.
- Leadership in managing automated deployment pipelines, including Azure DevOps.
- Proficiency with enterprise monitoring platforms (e.g., Azure Insights, Grafana) and predictive analytics tools.
- Advanced scripting skills with PowerShell, Python, or similar.
- Experience in incident management and defining SLAs for global environments.
- Knowledge of database management, especially PostgreSQL.
Nice to Have
- Advanced certifications in cloud platforms (e.g., Azure Solutions Architect Expert).
- Experience with ITSM tools like ServiceNow.
- Understanding of security and compliance in cloud environments.