Enable job alerts via email!

Software Engineering Manager, Site Reliability, Cloud Incident Response

Google

City Of London

On-site

GBP 70,000 - 90,000

Full time

Yesterday
Be an early applicant

Job summary

A leading tech company in London is seeking a Site Reliability Engineer to enhance the reliability and performance of Google Cloud services. The role demands a blend of software development and technical leadership experience, focusing on incident response and tooling. Candidates should have a strong background in software development, particularly in Python, and experience leading teams. This position offers an opportunity to manage complex scale challenges unique to cloud infrastructure.

Qualifications

  • 8 years of experience with software development, including experience in leadership roles.
  • Experience with cloud services, telemetry systems, and incident response.

Responsibilities

  • Participate in on-call rotation supporting Critical Incident Response for GCP.
  • Define and escalate risks in Cloud, improve issue detection.

Skills

Software development in Python
Cloud services
Team leadership
Automation

Education

Bachelor's degree in a relevant field
Master's degree or PhD preferred
Job description
Minimum qualifications
  • Bachelor's degree or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages (e.g., Python, C, C , Java, JavaScript).
  • 3 years of experience in a technical leadership role; overseeing projects, with 2 years of experience in a people management, supervision/team leadership role.
  • Experience with cloud services, telemetry systems and incident response.
Preferred qualifications
  • Master's degree or PhD in Computer Science, or a related technical field.
  • Experience as a cloud customer.
About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance.

Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

The Cloud Incident Response Team supports the responders, tooling, and outcomes for Google Cloud Platform (GCP) major incidents. The team collaborates across GCP products, customer facing teams, and a wide range of stakeholders, where you will help coordinate, mitigate, or resolve issues across all of GCP.

Google Cloud accelerates every organization's ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google's cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

Responsibilities
  • Participate in on-call rotation supporting Critical Incident Response for GCP.
  • Focus on high-quality customer outcomes and continuous collaboration across GCP teams.
  • Create Incident Management at Google (IMAG) training and processes for the incident management lifecycle in partnership with Cloud SRE Tech Leads, and the Cloud Support leadership team.
  • Build systems and tooling to support the team, enhance visibility, improve issue detection, and facilitate communication with customers, stakeholders, and other customer-facing teams.
  • Define and escalate risks in Cloud, reduce incident probabilities with strategic and pragmatic approaches as needed.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.