Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer

Tribal Group

Remote

GBP 50,000 - 70,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading EdTech business is looking for a Site Reliability Engineer to design, build, and operate large-scale systems focusing on reliability and automation. You will maintain production systems, support deployments, enhance automation tools, and analyze system performance metrics. The ideal candidate has strong experience with AWS or Azure, Linux, Apache, and PHP, with excellent communication skills. This is a full-time, fully remote role based in the UK, offering a chance to work with innovative software solutions in the education sector.

Qualifications

  • Strong experience with AWS or Azure environments.
  • Solid knowledge of Linux, Apache, and PHP in a production context.
  • Familiarity with automation/configuration tools like Ansible.
  • Experience with monitoring and logging platforms such as DataDog or New Relic.
  • Understanding of database fundamentals.
  • Hands-on troubleshooting and problem-solving skills.
  • Customer-facing experience with incident or service management tools.
  • Strong communication skills to translate technical details.

Responsibilities

  • Maintain and improve production systems for reliability.
  • Support application deployment to production environments.
  • Build or enhance automation tools.
  • Implement and manage observability tools.
  • Analyze logs and metrics to improve reliability.
  • Support incident response and conduct root-cause analysis.
  • Collaborate with engineering and customer teams.

Skills

AWS environments
Linux
Apache
PHP
Ansible
DataDog
New Relic
SQL Server
Oracle
Python
PowerShell
Bash

Tools

RemedyForce
ServiceNow
Azure Monitor
Azure DevOps
Job description

As a Site Reliability Engineer, you'll design, build, and operate large-scale systems with an emphasis on reliability, efficiency, and automation. You'll work across deployment, monitoring, and incident response to ensure our platforms stay healthy and our customers experience uninterrupted service.

Responsibilities
  • Maintaining and improving production systems for availability, latency, and scalability
  • Supporting application deployment and configuration to production environments
  • Building or enhancing automation tools (Ansible, scripts, utilities)
  • Implementing and managing observability tools such as DataDog or New Relic
  • Analyzing logs and metrics to identify trends and improve reliability
  • Supporting incident response and performing root‑cause analysis
  • Collaborating closely with engineering and customer teams to deliver proactive, preventative support
  • Participating in on‑call and out‑of‑hours rotations in line with Tribal's On‑Call Policy

This is a full‑time, fully remote UK‑based role, with occasional national travel for team collaboration or customer engagements.

Qualifications
  • Strong experience with AWS (or Azure) environments
  • Solid knowledge of Linux, Apache, and PHP in a production context
  • Familiarity with automation/configuration tools such as Ansible
  • Experience with monitoring and logging platforms (e.g. DataDog, New Relic, Azure Monitor)
  • Good understanding of database fundamentals (SQL Server / Oracle)
  • Hands‑on troubleshooting and problem‑solving skills
  • Customer‑facing experience with incident or service management tools (RemedyForce, ServiceNow)
  • Strong written and verbal communication skills, able to translate technical details clearly
Nice‑to‑have
  • Experience coding or scripting (Python, PowerShell, or Bash)
  • Understanding of CI/CD pipelines (Azure DevOps or similar)
  • ITIL Foundation or cloud certifications (AWS SysOps Administrator, AWS Solutions Architect)

Tribal is a leading EdTech business providing market‑leading software solutions to the global education market. We research, develop, and deliver the products, services, and solutions that education institutions worldwide rely on to support their core mission: educating students, delivering exceptional learning experiences, and achieving successful outcomes.

Our Platform Engineering function is at the heart of this, ensuring our systems are designed and maintained to the highest standards of reliability and security. As part of the SRE & Operations team, you'll play a key role in delivering Tribal's products through the public cloud as SaaS services across AWS and Azure.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.