We are seeking successful candidates to join an on-site team responsible for maintaining and supporting a managed cross-domain service that leverages a wide range of technologies, platforms, and tools. The team applies site reliability engineering (SRE) principles to continuously monitor, verify, and improve service performance.
Responsibilities:
- Build and Deploy Code Across Multiple Project Teams: Maintain and administer CI pipelines that build artefacts using Java, Maven, and NPM.
- Configure and execute component and service acceptance test suites using Maven and Cypress.
- Deploy and configure services using Terraform and Ansible on target platforms including OpenShift, RHEL/CentOS, and Docker.
- Configure and deploy third-party appliances and software services.
- Verification & Monitoring: Monitor performance and availability using InfluxDB and Grafana.
- Set up automated alerts to proactively detect issues before they escalate.
- Review logs and respond to unexpected system behaviours in real time.
- Support and Troubleshooting: Provide second- and third-line support, addressing business-critical issues.
- Escalate problems and coordinate incident response.
- Conduct root cause analysis and implement solutions to prevent recurrence.
- Apply rapid and safe changes in response to evolving service requirements.
- Business-as-Usual Maintenance: Employ automation tools to reduce manual effort (toil).
- Perform regular database housekeeping.
- Conduct OS-level health checks and patch management.
- Support data centre operations across multiple physical locations.
Key Skills:
- Background in a managed service environment, with a focus on service delivery and customer outcomes.
- Application development experience with Java or similar languages.
- Strong analytical thinking and problem-solving skills.
- Ability to communicate effectively across technical and non-technical teams.
- Capable of prioritising and adapting quickly, especially in high-pressure incidents.
- Proficiency with Git for version control.
Desirable Skills:
- Experience deploying and managing microservice-based architectures.
- Familiarity with asynchronous messaging platforms (e.g., AMQP).
- Hands-on experience with Terraform, Ansible, and other Infrastructure-as-Code tools.
- Knowledge of S3-compatible object storage solutions.
- Experience working with RDBMS platforms such as Oracle.