Enable job alerts via email!

Site Reliability Engineering Manager

General Motors

United States

Remote

USD 120,000 - 160,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in the automotive sector is seeking an SRE Engineering Manager who will lead a team in enhancing system reliability and efficiency. This role requires a blend of technical expertise and people leadership, focusing on automation, incident response, and collaboration with development teams. The ideal candidate will have strong programming skills and a commitment to continuous improvement.

Qualifications

  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience handling production incidents, including root cause analysis.

Responsibilities

  • Develop tools to automate operational processes and improve system reliability.
  • Lead and improve monitoring and observability frameworks.
  • Participate in on-call rotation to mitigate production incidents.

Skills

Programming Skills
Incident Management
Communication and Collaboration
Automation Focus

Job description

As an SRE Engineering Manager, you will be expected to not only lead your team in setting priorities and ensuring alignment with organizational goals but also to be deeply technical. We expect our managers to be able to contribute directly through coding, reviewing code, and mentoring engineers. While it's unlikely that you'll spend the majority of your time coding, having the capability and willingness to dive into technical details, solve problems hands-on, and support your team's technical decisions is crucial. You'll be a mentor, guide, and a partner, helping engineers grow, and ensuring the reliability and efficiency of the systems they are working on. We believe in setting a high bar for engineering managers who can lead by example in both technical expertise and people leadership.

Required Experience:

  • Automation and Reliability Improvements: Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention.
  • Observability and Monitoring: Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents.
  • Incident Response: Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution.
  • Collaboration with Development Teams: Work alongside developers to ensure the quality, scalability, and reliability of our services. Practice shared ownership of services in production, fostering a "You build it, you run it" culture.
  • Service Level Management: Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively.
  • Engineering for Reliability: Strong understanding of common application reliability patterns, with hands-on experience implementing them.
  • Failure Analysis and Post-Incident Reviews: Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence. Champion a culture of continuous improvement.
  • Cost Efficiency: Evaluate system performance and advocate for optimisations that reduce infrastructure costs while maintaining service reliability.

Skills and Qualifications:

  • Programming Skills: Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems.
  • Systems Knowledge: Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures.
  • Strong Understanding of System Fundamentals: Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures. Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources.
  • Incident Management: Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures.
  • Communication and Collaboration: Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders. Commitment to collaborative problem-solving and shared ownership of services.
  • Automation Focus: Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Program Manager, Site Reliability Engineering

ZipRecruiter

Bodega Bay

Remote

USD 112,000 - 161,000

3 days ago
Be an early applicant

Site Reliability Engineering Manager

Canonical

Delhi Township

Remote

USD 120,000 - 160,000

Yesterday
Be an early applicant

Site Reliability Engineering Manager

General Motors of Canada

Remote

USD 120,000 - 160,000

2 days ago
Be an early applicant

Senior Manager Site Reliability Engineering (Kubernetes)- Remote

Akamai Technologies

Remote

USD 155,000 - 324,000

18 days ago

Shift Manager, Site Reliability Engineering - Federal - 3rd Shift (Nights)

ServiceNow

San Diego

Remote

USD 126,000 - 216,000

Today
Be an early applicant

Site Reliability Engineering Manager, GCP

Motorsport Hackers

Dearborn

Remote

USD 90,000 - 150,000

30+ days ago

Site Reliability Engineering Manager, GCP

Ford Pro

Dearborn

Remote

USD 100,000 - 160,000

30+ days ago

[Hiring] Program Manager, Site Reliability Engineering @Veeam Software

Veeam Software

Remote

USD 136,000 - 195,000

30+ days ago

Site Reliability Engineering Manager

Canonical Ltd

Remote

USD 90,000 - 150,000

30+ days ago