Enable job alerts via email!

AVP/Senior Associate, Platform SRE Engineer, SRE & Governance, Group Technology

DBS

Singapore

On-site

SGD 100,000 - 130,000

Full time

4 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading bank in Singapore seeks a Platform SRE Engineer to enhance application performance and platform reliability. Responsibilities include developing monitoring strategies, automating processes, and collaborating with stakeholders on performance metrics. Ideal candidates have at least 10 years of IT experience, strong scripting skills, and knowledge of observability tools like AppDynamics and Grafana. This role offers a chance to drive continuous improvement and efficiency within the team.

Qualifications

10+ years of IT work experience.
Strong experience in site reliability engineering (SRE) practices.
Deep knowledge of scripting for efficiency and scalability.

Responsibilities

Develop monitoring guidelines for various applications.
Implement observability tools and standards.
Automate routine tasks to improve efficiency.

Skills

AppDynamics

ELK Stack

Grafana

Python scripting

Unix/Linux

CI/CD automation (Jenkins, Ansible)

Education

University graduate in computer science or related field

Tools

AppDynamics

Grafana

Confluent Kafka

Prometheus

Job Description - AVP/Senior Associate, Platform SRE Engineer, SRE & Governance, Group Technology (250000DV)

Job Description

The Role:

We are looking for a Platform SRE Engineer with experience working on enterprise level data engineering, analytics, and observability applications. The SRE engineer would be responsible for ensuring high availability of the platform services and perform continuous improvements to increase the platform’s efficiency and resiliency. The SRE engineer will also perform automation development tasks to remove toil and increase the team’s productivity.

Responsibilities:

Develop monitoring and onboarding guidelines for various applications using observability platform stack, ensuring accurate monitoring and data collection.
Implement Observability standards, best practices, operations and processes for the Enterprise in AppDynamics & other observability tools
Automate routine tasks and reporting processes using APIs and scripting, reducing manual effort and improving efficiency in AppDynamics & other observability tools
Identify and resolve performance issues through detailed analysis of transaction traces, application logs, and system metrics.
Collaborate with stakeholders to define performance metrics and monitoring requirements aligned with business goals.
Contribute to internal knowledge bases, create documentation, and share insights with the team to promote a culture of learning and collaboration.
Design and implement monitoring solutions to track application performance, identifying bottlenecks, capacity planning and optimising system efficiency.
Develop custom dashboards and reports to provide actionable insights and drive decision-making processes.
Collaborate with development and operations teams to integrate Observability platform stack with CI/CD pipelines and other DevOps tools.
Configure and fine-tune alerts to proactively detect and address performance issues before they impact end-users.
Continuously review and enhance monitoring processes and methodologies to improve efficiency and effectiveness.
Work with application teams to develop long-term monitoring strategies that align with business goals and technology roadmaps.
Create data retention polices and access controls (RBAC) to manage user permissions.
Perform application maintenance, patching, upgrading controller versions, agents etc and ensure EOS/EOL is maintained.

Requirements:

U niversity graduate (computer science or related field) with good experience working with contemporary technologies and scripting languages.
Min 10 years of IT work experience.
Working knowledge in AppDynamics, ELK Stack, Grafana, Open Telemetry (OTEL)
In-depth experience in Unix/Linux/Shell/Python scripting with quality, scalability, and extensibility.
Experience in triaging and troubleshooting application problems quickly in monitoring tools by using various techniques - Transaction snapshots, Diagnostic Sessions, Data Collectors
Knowledgeable and experienced in SRE (Site Reliability Engineering) practices covering monitoring, observability, performance management, automation, and resiliency.
Knowledge in Confluent Kafka, Prometheus & other APM tools (Dynatrace, Datadog, New Relic, Splunk) is a plus.
Knowledge in AI/ML capabilities to automate RCA’s and shorter MTTR when issues arise.
Good understanding of Network routing, Load balancing and Networking protocols; a base knowledge of TCP/IP, with an understanding of HTTP and DNS
Ability to contribute to discussions on design and strategy.
Good problem diagnosis and creative problem-solving skills
Experience in automation tools and CICD – Jenkins, Ansible

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.