Join to apply for the Site Reliability Engineering Manager role at General Motors
Job Description
As an SRE Engineering Manager, you will be expected to lead your team in setting priorities and ensuring alignment with organizational goals, while also being deeply technical. Our managers are expected to contribute directly through coding, reviewing code, and mentoring engineers. Although not the primary focus, the ability and willingness to engage in technical details, solve problems hands-on, and support your team's technical decisions are crucial. You will serve as a mentor, guide, and partner, helping engineers grow and ensuring the reliability and efficiency of their systems. We set a high standard for engineering managers who lead by example in both technical expertise and people leadership.
Required Experience:
- Automation and Reliability Improvements: Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention.
- Observability and Monitoring: Lead, implement, and improve monitoring and observability frameworks to enable proactive incident detection and resolution.
- Incident Response: Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution.
- Collaboration with Development Teams: Work alongside developers to ensure the quality, scalability, and reliability of services, fostering a "You build it, you run it" culture.
- Service Level Management: Manage SLIs, SLOs, and SLAs to effectively handle reliability expectations.
- Engineering for Reliability: Have a strong understanding of common application reliability patterns and experience implementing them.
- Failure Analysis and Post-Incident Reviews: Conduct deep-dive analyses of incidents, collaborate on reviews, and champion a culture of continuous improvement.
- Cost Efficiency: Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining reliability.
Skills and Qualifications:
- Programming Skills: Proficiency in at least one language (e.g., Python, Go, Java) and familiarity with multiple ecosystems.
- Systems Knowledge: Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures.
- System Fundamentals: Deep understanding of how code runs on hardware, including OS, algorithms, and data structures, with the ability to troubleshoot and optimize code.
- Incident Management: Experience handling production incidents, root cause analysis, and complex system failures.
- Communication and Collaboration: Strong skills in explaining technical concepts to diverse stakeholders and fostering shared ownership.
- Automation Focus: Proven experience automating manual processes, building deployment pipelines, or managing configuration systems.
Preferred Experience:
- Experience with cloud platforms (AWS, GCP, Azure).
- Familiarity with container orchestration systems like Kubernetes.
- Experience managing or developing distributed systems.
- Prior experience with Java in production environments.
This role is remote, and the successful candidate may be based anywhere in the UK, without the need to report to a GM worksite unless directed. GM will provide immigration sponsorship for this role.