Overview
Security Cleared Site Reliability Engineer - Contract Outside IR35 - 3 months+ - Hybrid
We are seeking a Lead Operations/Site Reliability Engineer to take ownership of day-to-day operations across a legacy technology estate. The role will focus on maintaining service stability, ensuring operational readiness, and leading the response to incidents and outages. The Lead Operations/Site Reliability Engineer will play a pivotal role during the transition phase by embedding operational standards, improving monitoring and support processes, and enabling knowledge transfer into ongoing service delivery teams.
Responsibilities
- Lead daily operational support of legacy systems, ensuring availability, performance, and resilience.
- Manage incident, problem, and change activities in line with ITIL and enterprise service standards. Proactively monitor and tune infrastructure, applications, messaging, and scheduling platforms.
- Act as the escalation point for critical incidents, coordinating technical resources to achieve rapid resolution. Lead root cause analysis and service improvement initiatives.
- Define and maintain runbooks, standard operating procedures, and operational documentation.
- Ensure backup, recovery, and disaster recovery processes are operationally tested and aligned to business needs.
- Oversee job scheduling, batch management, and automation activities (e.g., Tivoli Scheduler).
- Collaborate with Infrastructure, Development, and Architecture teams to support upgrades, migrations, and modernisation efforts.
- Mentor operations engineers and manage knowledge transfer from discovery into business-as-usual operations.
Competencies
- Technical background in Java, AWS and Kubernetes
- Customer Engagement management. Strong leadership and coordination skills across technical and non-technical stakeholders.
- Excellent analytical and diagnostic abilities, with a structured approach to discovery and documentation. Skilled in documenting processes, monitoring metrics, and reporting on operational health.
- Excellent communication and documentation skills for effective knowledge capture and handover. Excellent communication skills, particularly in high-pressure incident management situations.
- Ability to operate in both deep technical detail and higher-level architectural/system view.
- Analytical and detail-oriented, with a continuous improvement mindset.
- Incident Management, Resilient under pressure and effective at prioritising competing demands.
Experiences
- Ability to prioritise effectively in a complex, multi-system environment.
- Technical skills in Java, AWS and Kubernetes
- Current and active SC Clearance
- Proven track record in managing/supporting enterprise-scale services.
- Experience of working in a multivendor environment, where co-ordination, triage and joint working is essential for operational activities.
- Familiarity with ITIL-aligned service management.
- Building KT libraries
- Previous involvement in system migrations, re-platforming, or legacy modernisation programmes highly desirable.
- Background in high-availability, disaster recovery, and enterprise integration patterns.