This role is for one of the Weekday's clients
Min Experience: 5 years
Location: Singapur
JobType: full-time
As a Datadog L3 Engineer, you will play a critical role in designing, implementing, and operating advanced observability solutions for complex, large-scale technology environments. Based in Singapore, this full-time role is ideal for a highly skilled professional with deep hands-on experience in monitoring, logging, metrics, and real-user monitoring (RUM). You will act as a subject matter expert for Datadog, supporting mission-critical systems, driving operational excellence, and ensuring high availability, performance, and reliability across infrastructure and applications. This role requires strong collaboration with engineering, DevOps, and operations teams, along with a solid understanding of ITIL practices and modern cloud-native tooling.
Key Responsibilities
- Design, configure, and maintain end-to-end observability solutions using Datadog, including logs, metrics, traces, and RUM for distributed systems
- Act as an L3 escalation point for complex monitoring, performance, and availability issues, performing deep root cause analysis and remediation
- Implement and optimize log management pipelines, dashboards, alerts, and service-level indicators (SLIs/SLOs) to improve system visibility and reliability
- Lead the setup and tuning of infrastructure and application monitoring across containerized and cloud environments
- Build and manage monitoring infrastructure as code using Terraform, ensuring consistency, scalability, and repeatability
- Support Docker-based platforms by monitoring container health, performance, and resource utilization
- Integrate Datadog with CI/CD pipelines and cloud services to enable proactive detection of issues
- Collaborate with DevOps, SRE, and application teams to define observability standards and best practices
- Ensure adherence to ITIL processes for incident, problem, and change management
- Create and maintain detailed documentation, runbooks, and operational guides
- Continuously evaluate system performance trends and recommend improvements to enhance stability and user experience
- Mentor junior engineers and provide technical guidance on observability and monitoring strategies
What Makes You a Great Fit
- At least 5 years of hands-on experience in monitoring, observability, or site reliability engineering roles
- Strong expertise with Datadog, including logs, metrics, dashboards, alerts, and Real User Monitoring (RUM)
- Proven experience using Terraform to manage infrastructure and monitoring configurations
- Solid hands-on knowledge of Docker and container-based environments
- Strong understanding of ITIL processes and experience working in structured operational environments
- Ability to troubleshoot complex, large-scale production issues with a methodical and analytical approach
- Experience working with cloud platforms and modern DevOps toolchains
- Excellent communication skills, with the ability to collaborate across technical and non-technical teams
- A proactive mindset with a strong focus on automation, reliability, and continuous improvement
- Comfortable working in fast-paced, high-availability environments with ownership and accountability