The Role:
This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.
This role is expected to strategically lead incident retrospective/problem management operations and other SRE activities related to maintenance management, including availability, performance, change management, monitoring, capacity planning, and solutions derived from emergency response.
The Team Lead must ensure that retrospective activities are effectively orchestrated and carried out, promoting a blameless culture in accordance with SRE principles.
Responsibilities:
- Mentor the team in facilitating and conducting root cause analysis (RCA) activities from start to finish.
- Lead facilitation for high-severity incidents, liaising with senior management and providing regular updates.
- Present findings and action plans at RCA Forum, Tech Risk Forum, and other senior management meetings.
- Rapidly absorb and effectively apply new technology.
- Communicate clearly with both technical and non-technical colleagues.
- Work to high standards within agreed timescales.
- Perform any other reasonable tasks as requested by supervisors or senior management.
- Manage resources to ensure effective problem management activities.
- Provide platforms and channels to keep stakeholders updated on retrospectives and RCA activities.
- Demonstrate authority during problem management calls.
- Serve as the point of contact for high-severity incidents, from retrospective calls to Management Report documentation and publication.
- Take accountability for initiatives to enhance SRE practices based on retrospectives.
- Collaborate with Engineering Teams within SRE and with Lines of Business (LOBs) on preventive enabling activities.
Requirements:
- Minimum 15 years of experience in process improvement/RCA, leading discussions as a problem manager or incident commander, preferably in Technology & Operations.
- Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bitbucket, S3, and Cloud Computing.
- Good exposure to logging and monitoring tools like Dynatrace, Prometheus, Grafana, ELK/ELK.
- Deep understanding of Incident & Problem Management functions and activities, including hardware- and software-related issues.
- Ability to work with stakeholders and command centers in troubleshooting, escalating, and resolving critical site incidents.
- Identify recurring issues and collaborate with cloud, infrastructure, product development, vendors, and other stakeholders to investigate and resolve causes.
- Maintain accurate incident documentation, including impact, timelines, and mitigation steps.
- Strong verbal and written communication skills, especially for documentation.
- At least 10 years of software development, technical support, or operations experience.
- Basic knowledge of Linux, AIX, Solaris, and Windows.
- Exposure to enterprise databases like Oracle, SQL Server, MariaDB, MongoDB, and Sybase.
- Knowledge of systems, multi-tier applications, and network troubleshooting.
- Awareness of Public/Private/Hybrid cloud solutions.