Enable job alerts via email!

VP, Problem & Knowledge Management Specialist, SRE & Governance, Group...

DBS Bank Limited

Singapore

On-site

SGD 120,000 - 160,000

Full time

Yesterday
Be an early applicant

Job summary

A leading financial institution in Singapore is seeking a Team Lead for the SRE Problem and Knowledge Management team. This role requires over 15 years of experience in process improvement and incident management. The ideal candidate will mentor the team, lead high-severity incidents, and collaborate with engineering teams. Proficiency in tools like JIRA, Confluence, and cloud computing is essential. Strong communication skills and accountability in problem management are key to this position.

Qualifications

  • Minimum 15 years of experience in process improvement and root cause analysis.
  • Strong exposure to incident and problem management functions.
  • Min 10 years software development or technical support experience.

Responsibilities

  • Mentor the team in root cause analysis activities.
  • Lead facilitation for high-severity incidents.
  • Communicate effectively with technical and non-technical colleagues.

Skills

Process improvement
Root cause analysis
Incident management
Cloud Computing
Strong communication skills

Tools

JIRA
Confluence
Jenkins
Dynatrace
Prometheus
Grafana
Oracle
SQL Server
MongoDB
Job description
The Role

This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.

This role is expected to strategically lead the conduct of incident retrospective/ problem management operations and in other SRE activities in general which pertains to maintenance management that includes availability, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.

The Team Lead is to make sure that the retrospective activities are orchestrated & carried out effectively while promoting the blameless culture in accordance with the SRE principles.

Responsibilities
  • Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end
  • Lead the facilitation for high-severity incidents liaising with top/ senior management and keeping the latter updated
  • Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans
  • Absorb new technology rapidly & apply effectively
  • Communicate well with technical & non-technical colleagues
  • Work to a high standard with agreed timescales
  • Undertake any other tasks or duties that are reasonable & requested by the supervisor or a member of the senior management team.
  • Do resource management to ensure problem management activities are carried out in an effective and efficient manner
  • Provide available platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities
  • Able to demonstrate authority in the problem management calls.
  • Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing
  • Take accountability for initiatives on the enhancement activities related to SRE as a result of retrospectives
  • Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of the preventive measures
Requirements
  • Minimum 15 years of process improvement/ root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space
  • Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bit bucket, S3, Cloud Computing.
  • Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK
  • In depth understanding of Incident & Problem Management functions & activities (i.e. Hardware- & Software-related incident & problem management)
  • Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents.
  • Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause
  • Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution.
  • Strong verbal & written communication skills particularly effective documentation skills
  • Min 10 yrs of software development or technical support or operations experience.
  • Basic knowledge of Linux, AIX, Solaris and Windows
  • Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase.
  • Knowledge in systems & multi-tier application & network troubleshooting
  • Essential knowledge & awareness of Public/Private/Hybrid cloud solutions.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.