Enable job alerts via email!
Boost your interview chances
Create a job specific, tailored resume for higher success rate.
An established industry player is seeking a Site Reliability Engineering Manager to oversee the technology platforms that power their website. This role involves ensuring high availability and performance standards while leading a team to migrate services to Google Cloud. The ideal candidate will have extensive experience in managing technical operations and a strong understanding of DevOps principles. You will drive continuous improvement initiatives and collaborate closely with engineering teams to enhance service reliability. If you thrive in high-pressure environments and are passionate about technology, this opportunity is perfect for you.
The Platform and Reliability Engineering Team are responsible for the technology platforms and services that underpin the Rightmove website, ensuring it is available, secure and performing to a world-class standard. We strive to deliver annual availability of at least 99.99% (less than 5 mins downtime a month).
The Site Reliability Engineering Manager’s focus is to ensure their teams maintain our datacentre and cloud website infrastructure, safely migrate services to Google Cloud, and enable others to easily manage the reliability of production services across the Rightmove Website Estate.
A typical week as the Site Reliability Engineering Manager might involve:
·Ensuring the right people, process and tooling are in place to maintain a healthy, resilient, and secure datacentre and cloud website platform.
·Creating and managing technical plans for the migration of applications and infrastructure to Google Cloud.
·Developing cloud engineering and operations skills within your teams
·Working through supplier due diligence process for support contract renewals to ensure key components are kept in support.
·Working with engineering managers, product owners, and engineers to optimise and improve service health
·Identify, plan and implement improvements to the incident management process
·Reducing handoffs or improving flow/lead times within development teams by providing operational/infrastructure support for the platform.
We’re looking for someone who:
·Has previous experience managing engineers that are building and running website infrastructure and web services and previous experience running website technical operations.
·Is highly operationally aware, understanding what it takes to maintain a healthy website infrastructure and services.
·Is an experienced manager who understands how to get the best out of their people and teams.
·Has excellent judgement and can instill this in engineers, leading them to the best outcomes on technical decisions and architecture whilst enabling their development.
·Is happy to dive deep into technical discussions with their team and can surface risks and issues relating to projects.
·Is able to keep calm and work effectively in high pressure situations
·Has experience migrating infrastructure and web services from datacentres to cloud
·Has deep experience and understanding of DevOps and SRE principles and practices
·Always pushes for continuous improvement and has strong attention to detail
Relevant Technology we use:
·F5, Juniper, Arbor
·VMware, HP 3Par
·Google Cloud Platform
·Google Kubernetes Engine with Anthos Service Mesh
·Confluent Cloud
·Incident.io
·Gitlab
·Jira, Confluence, Slack, Teams
·Elastic APM, Kibana
·Eggplant Monitoring, Xymon
· Java, Node, Python, Javascript, Go