About the Role
The Availability Manager is responsible for ensuring that IT services consistently meet agreed availability and reliability targets as defined in Service Level Agreements (SLAs), Operational Level Agreements (OLAs). The role leads the design, measurement, analysis, reporting, and continual improvement of service availability across the full-service lifecycle, working closely with Incident, Problem, Change, and Operations teams. This position directly supports business continuity, operational resilience, and regulatory compliance.
The Availability Manager is responsible for ensuring that IT services consistently meet agreed availability and reliability targets as defined in Service Level Agreements (SLAs), Operational Level Agreements (OLAs). The role leads the design, measurement, analysis, reporting, and continual improvement of service availability across the full-service lifecycle, working closely with Incident, Problem, Change, and Operations teams. This position directly supports business continuity, operational resilience, and regulatory compliance.
Responsibilities
- Availability Management (Core)
- Define and maintain availability requirements for all critical services in alignment with business needs.
- Establish and manage availability targets (SLA / OLA).
- Own the Availability Management Framework and process in accordance with ITIL 4 practices.
- Develop and maintain Availability Plans for business-critical services.
- Ensure availability is designed into new services (Service Design & Service Introduction).
- KPI, SLA & Performance Management
- Define, implement, and govern availability KPIs (e.g., uptime %, MTBF, MTTR, service reliability index).
- Build and maintain service availability dashboards and executive reports.
- Track SLA compliance and perform deviation and trend analysis.
- Provide monthly, quarterly, and annual availability reports to stakeholders.
- Identify chronic availability risks and services under SLA breach risk.
- Incident & Problem Management Integration
- Act as a senior stakeholder in Major Incidents related to service outages.
- Ensure post-incident reviews (PIRs) include availability impact assessment.
- Work with Problem Management to drive permanent fixes for recurring availability issues.
- Validate root cause analysis quality and corrective action effectiveness.
- Risk, Resilience & Continuity
- Identify availability risks and single points of failure (SPOFs).
- Work with IT Service Continuity Management (ITSCM) to ensure DR/BCP alignment with availability objectives.
- Validate redundancy, failover, monitoring coverage, and recovery capabilities.
- Support operational resilience and regulatory audits (e.g., financial regulators).
- Review architecture designs for high-availability, fault tolerance, and maintainability.
- Define availability design standards and patterns (active-active, active-passive, clustering, etc.).
- Participate in Change Advisory Board (CAB) for high-risk changes impacting availability.
- Approve availability criteria for go-live readiness.
- Monitoring & Observability Alignment
- Define service availability monitoring requirements (synthetic, real-user, infrastructure, application).
- Ensure accurate service health modeling and dependency mapping.
- Validate alerting thresholds and event correlation effectiveness.
- Work with Command Center / NOC / Observability teams to ensure proactive detection.
Qualifications
- Bachelor’s degree in IT, Computer Science, Engineering, or related field
- ITIL 4 certification (Foundation mandatory; Managing Professional preferred)
- Additional certifications are advantageous:
- ISO 20000
- ISO 22301 (Business Continuity)
- SRE / Cloud / Architecture certifications
Required Skills
- Technical & Process
- Strong knowledge of ITIL 4 practices, especially:
- Availability Management
- Change Enablement
- Service monitoring and observability concepts
- SLA modeling and KPI engineering
- High-availability architecture principles
- Disaster recovery and resilience design
- Data analysis and reporting (Excel, BI tools, dashboards)
- Analytical & Business
- Root cause analysis
- Trend analysis and forecasting
- Executive reporting
- Soft Skills
- Structured communication
- Crisis management participation
- Negotiation with technical teams and business owners
- Documentation and governance mindset
- Tools & Technologies (Typical)
- Monitoring / Observability platforms (Elastic, Dynatrace, Splunk, AppDynamics, etc.)
- CMDB and service mapping tools
- Reporting tools (Excel, Power BI, Tableau)
- Incident & problem tracking systems
Pay range and compensation package
10–15+ years in IT operations or service management. Experience in regulated or large-scale environments is preferred.