City Of London
Hybrid
GBP 65,000 - 85,000
Full time
Job summary
A leading telecommunications company in the City of London seeks an experienced Incident Manager to oversee technical incident responses. The role requires deep AWS expertise, strong analytical abilities in cloud-native environments, and excellent collaboration skills. You will lead cross-functional teams, monitor service health, and manage incidents, ensuring minimal business impact. Competitive compensation and potential benefits offered.
Qualifications
- Proven experience in Incident Management or similar roles.
- Strong understanding of web architecture and microservices.
- Deep hands-on experience with AWS Cloud.
Responsibilities
- Serve as the primary technical point of contact during incidents.
- Lead and coordinate cross-functional incident response teams.
- Monitor service health using tools like CloudWatch and Grafana.
Skills
AWS Cloud Services
Incident Management
Linux/Unix Administration
Agile Methodologies
Scripting (Python, Bash)
Monitoring Tools (CloudWatch, Grafana)
Debugging Skills
Leadership
Tools
Jira
Confluence
Git
Jenkins
Responsibilities
- Serve as the primary technical point of contact during critical incidents, ensuring rapid resolution and minimal business impact.
- Lead and coordinate cross-functional teams (engineering, support, operations) during incident response, including root cause analysis, mitigation strategies, and post-mortem reviews.
- Monitor service health using tools such as CloudWatch, OpenSearch, Kibana, Grafana, and proactively identify potential issues before they impact customers.
- Troubleshoot and debug production issues in web architecture, microservices, and cloud environments.
- Manage and maintain system reliability by implementing best practices in observability, monitoring, and alerting.
- Collaborate closely with Software Development, Infrastructure, and Operations teams to improve incident response processes and system resilience.
- Manage incidents related to AWS services such as EC2 S3 RDS, DynamoDB, Aurora, Redis, Memcache, Kafka, SNS, SQS, OpenSearch, and Elasticsearch.
- Use Agile tools (Jira, Confluence) to track incident tickets, document resolutions, and maintain a clear audit trail.
- Oversee system and application deployments, supporting automation pipelines (Jenkins, Git).
- Perform Linux/Unix administration tasks as needed during incident investigation and resolution.
- Continuously update and refine incident response playbooks, runbooks, and SOPs.
- Provide regular incident reports to leadership, including root cause analysis and long-term corrective actions.
Requirements
- Proven experience as an Incident Manager, Site Reliability Engineer (SRE), or Technical Operations Lead in cloud-native and microservices-based environments.
- Strong understanding of web architecture and microservices development principles.
- Deep hands-on experience with AWS Cloud Services: Compute (EC2 Lambda), Storage (S3), Databases (DynamoDB, RDS, Aurora), Messaging (Kafka, SNS, SQS), Caching (Redis, Memcache), Search (OpenSearch, Elasticsearch).
- Expertise in Agile tools: Jira, Confluence, Git, Jenkins.
- Strong Linux / Unix system administration skills, including troubleshooting and performance tuning.
- Strong analytical skills with expertise in debugging complex distributed system issues.
- Experience with monitoring and observability tools like CloudWatch, Grafana, Nagios, and Kibana.
- Excellent communication and leadership skills to manage cross-functional incident response teams.
- Experience in writing detailed post-incident reports and driving continuous improvement.
- Strong scripting skills (Python, Bash, or similar) to automate diagnostic or remediation tasks.