Overview
Design and deploy scalable, secure, and fault tolerant cloud environments across AWS, Azure, or GCP—optimising for performance, availability, and cost efficiency.
We are seeking a highly skilled and motivated Site Reliability Engineer to join our team. The ideal candidates will possess a good understanding of engineering principles, and broad understanding of full-stack software technologies, with hands-on expertise in application development, and tooling within a secure/on-prem environment, combined with a passion for applying best practices.
Responsibilities
- Enterprise Cloud Migrations: Lead migrations of legacy systems (e.g. lift and shift, re-architecture) to the cloud with minimal downtime.
- Automation & Infrastructure as Code (IaC): Use Terraform, CloudFormation, Ansible, or similar tools to automate cloud resource provisioning, CI/CD pipeline deployments, and configuration management.
- Security & Compliance Oversight: Implement IAM, encryption, VPC/NSG policies and ensure compliance with standards (e.g. GDPR, ISO 27001, SOC 2) across cloud environments.
- Monitoring, Optimization & Cost Governance: Continuously monitor workloads using tools like CloudWatch, Prometheus, Datadog; drive performance tuning and cost optimisation (rightsizing, reserved instances, auto scaling).
- Disaster Recovery & Business Continuity Planning: Develop and test backup/DR strategies, restore drills, and self-healing infrastructure to ensure reliability and uptime.
- Collaboration & Knowledge Sharing: Work closely with DevOps, development, security and operations teams; prepare architecture/design documents, network diagrams, runbooks and training materials.
- Client-site Engagement: The position requires team members to work from client-site to ensure reliability and availability of critical systems.
Qualifications
- Cloud Platforms: Hands on experience with AWS, Azure, or Google Cloud Platform.
- Infrastructure Automation: Proficiency with Terraform, CloudFormation, Ansible or equivalent IaC tools.
- Containerisation & Orchestration: Experience deploying and managing Docker and Kubernetes clusters (EKS, AKS, GKE or on prem).
- Programming / Scripting: Competent in Python, Bash, PowerShell or similar, for automation and tooling.
- Networking & Storage: Strong understanding of VPC architecture, subnets, firewalls, load balancers, and storage tiers.
- DevOps & CI/CD: Experience building pipelines with Jenkins, GitLab CI/CD, GitHub Actions or Azure DevOps.
- Security & Compliance: Implement and monitor IAM, encryption, audit logging, network isolation, and compliance frameworks.
- Monitoring & Optimization Tools: Familiarity with CloudWatch, Grafana, Datadog, Prometheus, ELK or similar.
- Other skills: English; GitLab; Kubernetes; Cloud Native Development.
- Applicants must be solely UK National and already hold HMG HLC clearance.