Enable job alerts via email!

Incident Manager

Airtel Africa

City Of London

Hybrid

GBP 65,000 - 85,000

Full time

Today
Be an early applicant

Job summary

A leading telecommunications company in the City of London seeks an experienced Incident Manager to oversee technical incident responses. The role requires deep AWS expertise, strong analytical abilities in cloud-native environments, and excellent collaboration skills. You will lead cross-functional teams, monitor service health, and manage incidents, ensuring minimal business impact. Competitive compensation and potential benefits offered.

Qualifications

  • Proven experience in Incident Management or similar roles.
  • Strong understanding of web architecture and microservices.
  • Deep hands-on experience with AWS Cloud.

Responsibilities

  • Serve as the primary technical point of contact during incidents.
  • Lead and coordinate cross-functional incident response teams.
  • Monitor service health using tools like CloudWatch and Grafana.

Skills

AWS Cloud Services
Incident Management
Linux/Unix Administration
Agile Methodologies
Scripting (Python, Bash)
Monitoring Tools (CloudWatch, Grafana)
Debugging Skills
Leadership

Tools

Jira
Confluence
Git
Jenkins
Job description
Responsibilities
  • Serve as the primary technical point of contact during critical incidents, ensuring rapid resolution and minimal business impact.
  • Lead and coordinate cross-functional teams (engineering, support, operations) during incident response, including root cause analysis, mitigation strategies, and post-mortem reviews.
  • Monitor service health using tools such as CloudWatch, OpenSearch, Kibana, Grafana, and proactively identify potential issues before they impact customers.
  • Troubleshoot and debug production issues in web architecture, microservices, and cloud environments.
  • Manage and maintain system reliability by implementing best practices in observability, monitoring, and alerting.
  • Collaborate closely with Software Development, Infrastructure, and Operations teams to improve incident response processes and system resilience.
  • Manage incidents related to AWS services such as EC2 S3 RDS, DynamoDB, Aurora, Redis, Memcache, Kafka, SNS, SQS, OpenSearch, and Elasticsearch.
  • Use Agile tools (Jira, Confluence) to track incident tickets, document resolutions, and maintain a clear audit trail.
  • Oversee system and application deployments, supporting automation pipelines (Jenkins, Git).
  • Perform Linux/Unix administration tasks as needed during incident investigation and resolution.
  • Continuously update and refine incident response playbooks, runbooks, and SOPs.
  • Provide regular incident reports to leadership, including root cause analysis and long-term corrective actions.
Requirements
  • Proven experience as an Incident Manager, Site Reliability Engineer (SRE), or Technical Operations Lead in cloud-native and microservices-based environments.
  • Strong understanding of web architecture and microservices development principles.
  • Deep hands-on experience with AWS Cloud Services: Compute (EC2 Lambda), Storage (S3), Databases (DynamoDB, RDS, Aurora), Messaging (Kafka, SNS, SQS), Caching (Redis, Memcache), Search (OpenSearch, Elasticsearch).
  • Expertise in Agile tools: Jira, Confluence, Git, Jenkins.
  • Strong Linux / Unix system administration skills, including troubleshooting and performance tuning.
  • Strong analytical skills with expertise in debugging complex distributed system issues.
  • Experience with monitoring and observability tools like CloudWatch, Grafana, Nagios, and Kibana.
  • Excellent communication and leadership skills to manage cross-functional incident response teams.
  • Experience in writing detailed post-incident reports and driving continuous improvement.
  • Strong scripting skills (Python, Bash, or similar) to automate diagnostic or remediation tasks.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.