Responsibilities:
Operational Support
- Lead and coordinate level 2 support operations for mission-critical applications and infrastructure
- Provide troubleshooting and diagnostics for incidents escalated from level 1
- Ensure adherence to SLA, system availability
Application Support
- Lead and resolve application incidents escalated from Level 1; perform root cause analysis and workarounds where possible
- Lead and monitor application logs, integration points such as REST API, message queues, file-based transfer
- Lead and liaise with Level 3 to resolve complex application issues and escalate bugs or enhancement requests
- Lead and support / maintain job schedulers, interface configurations and integration points
- Lead and document known issues, resolution procedure, rollback in the knowledge base
Incident & Problem Management
- Act as incident manager for P1/P2 issues
- Coordinate resolution and communications
- Perform root cause analysis and recommend permanent fixes
- Escalate unresolved issues that required software coding to Level 3 or engineering teams
Change Management
- Perform operational impact assessment
- Part of the CAB to review and approve change
- Pre-Change Preparation such as review Change Request and Release Plan
- Supervise post-change production verification
- Documentation update and knowledge transfer
- Post change review and feedback
Patch Management
- Perform patch management readiness
- Stakeholder coordination and team coordination
- System Readiness and Post-Patch Validation
- Documentation update and knowledge transfer
- Compliance and audit readiness
Documentation and Compliance
- Operational documentation. SOPs, Incident response checklist, RCA, PIR, monitoring and alert guidebook
- Configuration & Infrastructure Documentation. System configuration baseline, application dependency maps, environment inventories such as hosts, services, accounts
- Knowledge Base Articles for level 2 enablement and faster resolution e.g. Known Errors and Fixes, Frequent How-To Guides, Script Repositories, Lessons Learned
- Knowledge Management
Configuration Management
- Perform validation and accuracy of configurations
- Maintain readiness of operational documentation
- Perform audit to confirm compliance of configurations
- CMDB asset verification
- Change-linked configuration tracking
- Ensure environment consistency between DEV – IVVQ – ISO-PROD – UAT and PROD
Testing and Verification
- Ensure operational readiness testing before production deployment rollout
- Ensure post-change verification coordination
- Perform regression and sanity test following patching or upgrades, in UAT and PROD
- Participation in user acceptance testing
Knowledge Management
- Documentation of resolution
- Knowledge Base Contribution
- Validation of knowledge
- Subject Matter Expertise Sharing
Root Cause Analysis
- Gather logs, system metrics at the time of failure
- Reproduction of issues in a controlled environment to understand the conditions under which it occurs
- Determine the scope and severity in terms of the systems affected, downtime duration and business impact
- Narrow down the possible sources of causing the failure
- Use of diagnostic tools such to analyse the application behaviour
- Correlation of events to sequence the chain of events leading up to the failure and identify the dependencies
Leadership
- Supervision and provision of guidance to Level 2 engineers for change requests and service requests
- Lead and manage day-to-day operations of the Level 2 support
- Track and report the Level 2 key performance indicators such as resolution rate, mean time to resolve and system availability
- Process and quality improvement. Document down known issues, troubleshooting steps and standard operating procedures. Propose improvements to incident handling
- Identify tools and systems to streamline Level 2 support operations
Requirements:
Education and Experience
- Bachelor Degree in Information Technology, Computer Science, Engineering, or a closely related discipline
- At least 5 years in Level 2 support for mission critical 24x7 production support, preferably in public sector
- At least 2 years in a team lead or supervisory role, coordinating tasks and mentoring junior engineers
- Proven experience in handling P1/P2 incidents, managing post-incident reviews (PIRs) and root cause analysis
- Preferably certification in Red Hat Enterprise Linux or Kubernetes
Knowledge/Skills
- Operating Systems. RHEL (90%) and Windows Server (10%)
- Networking Fundamentals
- Middleware & Infrastructure (Web Server – Nginx, App Servers – Kubernetes with containers (Docker + Spring Boot)
- Message Queues (IBM MQ, Kafka)
- Java, C#, MQTT, Golang
- Database (SQL Server, PostgreSQL)
- ITIL/ITSM Process Knowledge
- Security Awareness
- DR and HA concepts
- Leadership & Coordination
- Communication & Collaboration
- Operational Governance