Location
Location: Richmond Hill, Ontario
About the Team
Our client’s platform engineering group operates with a Site Reliability Engineering (SRE) mindset, committed to delivering highly reliable, scalable, and performant systems across a public cloud infrastructure. The team specializes in enhancing system transparency, enabling deep diagnostics, and ensuring seamless collaboration between development and operations. Shared ownership, proactive problem-solving, and continuous improvement are at the core of everything they do.
The Opportunity
Our client is looking for a Senior Software Engineer with a strong background in application development and a passion for observability and system reliability. This hybrid role blends hands-on development with reliability engineering. You’ll work closely with existing microservices in Node.js and/or Java to enhance instrumentation and build out scalable observability frameworks that support modern containerized workloads on Kubernetes.
What You’ll Be Doing
- Create Observability Frameworks: Design and implement tools that make it easier to embed metrics, logs, and traces into applications.
- Enhance Application Monitoring: Analyze and improve the instrumentation of Node.js and Java services using Elastic APM to capture performance data and operational context.
- Define and Evangelize SRE Best Practices: Collaborate with engineers to define meaningful SLIs, SLOs, and KPIs, integrating them into ongoing development workflows.
- Monitoring Systems Architecture: Build and maintain scalable observability platforms using Elastic APM, InfluxDB, and Prometheus.
- Performance Analysis: Use system metrics, performance test data, and application code insights to diagnose bottlenecks and suggest optimizations.
- Incident Response & Resolution: Serve as a go-to expert during incidents, leveraging observability tools to identify root causes and propose fixes.
- Postmortems & Continuous Improvement: Lead structured reviews after incidents, recommending and implementing system improvements to avoid recurrence.
- Mentorship & Cultural Impact: Promote observability-first thinking across engineering teams by mentoring peers and embedding SRE practices into the development culture.
Must Have Skills
What You’ll Need to Succeed
- Bachelor’s degree in Computer Science, Software Engineering, or related discipline
- 5+ years of hands-on software development experience in Node.js and/or Java
- Professional experience with Docker and Kubernetes
- Proficiency in object-oriented programming and understanding of HTTP protocols & RESTful APIs
- Familiarity with both SQL and NoSQL databases
- Experience working in Linux/Unix environments and writing scripts
- Strong debugging, analytical, and collaboration skills
- Exposure to modern JavaScript frameworks (React, Angular, Vue, ExtJS, etc.)
- Solid grasp of software architecture, testing strategies, and performance monitoring principles
Nice-to-Haves
- Practical experience with Elastic APM, OpenTelemetry, or similar observability tools
- Experience building REST APIs using Spring Boot or Node.js
- Understanding of performance tuning and system capacity planning
- Exposure to testing tools such as Selenium, JUnit, Mockito, Mocha
- Familiarity with Oracle databases, PL/SQL, or servlet-based Java frameworks (Spring MVC, Struts, etc.)
- Web server experience with Apache, Tomcat, or Nginx
- Experience in SRE-focused roles, including development and maintenance of monitoring platforms
- Background with Infrastructure as Code (Terraform, etc.)
- Experience working in public cloud environments (AWS, Azure, or GCP)