Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.