Enable job alerts via email!

Principal Site Reliability Engineer - Remote

Bright Horizons

United States

Remote

USD 120,000 - 180,000

Full time

11 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Principal Site Reliability Engineer to ensure the reliability and scalability of its digital infrastructure. This remote role involves enhancing system performance, implementing monitoring solutions, and driving incident management practices. The ideal candidate will collaborate with cross-functional teams to foster a culture of innovation and continuous improvement.

Qualifications

  • Deep understanding of Cloud technologies and Distributed Systems.
  • Experience in Automation/Scripting and Observability.
  • Proven ability in incident management and troubleshooting.

Responsibilities

  • Contribute to reliability, scalability, and availability of digital infrastructure.
  • Implement monitoring and incident management practices.
  • Drive automation solutions to enhance efficiency.

Skills

Cloud technologies
Distributed Systems
Automation/Scripting
Observability
Software Engineering
DevOps

Tools

Dynatrace
Ansible
Terraform

Job description

The Principal Site Reliability Engineer (Principal SRE) plays a pivotal role in ensuring the seamless and reliable operation of an organization's digital infrastructure. This highly technical position will enhance the performance, scalability, and reliability of the organization's complex systems and applications. It will reduce time to detect and restore systems, increase uptime, and improve incident response by utilizing best practices in automation, monitoring, and incident management. This role requires a deep understanding of Cloud technologies, Distributed Systems, Automation/Scripting, Observability, Software Engineering, DevOps, and will take a proactive approach to preventing and mitigating potential issues. This role will report to the Director of Site Reliability Engineering and will help foster a culture of innovation, continuous improvement, and collaboration within the team to meet the organization's evolving needs and deliver a superior digital experience to users.

This is a Remote position available in the United States.

Responsibilities
  1. Reliability and Scalability: Contribute significantly to the reliability, scalability, and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
  2. Observability: Implement robust infrastructure, application, and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyze system performance, and facilitate quick response to incidents. Create dashboards, alerts, and automated workflows for use by Operations or Application teams.
  3. Incident Management: Drive troubleshooting of critical incidents by developing a deep understanding of our enterprise architecture across all 7 OSI layers. Utilize monitoring and alerting to ensure timely incident resolution. Track KPIs like MTTD/MTTR and identify opportunities for improvement. Conduct post-mortems to identify root causes and implement preventive measures.
  4. Automation and Efficiency: Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance efficiency of Product, Engineering, and SRE teams.
  5. Tools Ownership: Own Observability tools and create a roadmap to expand and consolidate. Provide a comprehensive view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure, and Enterprise Architecture.
  6. Collaboration: Collaborate with cross-functional teams to drive a unified approach to site reliability, optimizing work and improving time-to-market. Foster strong relationships to implement an SRE culture aligned with organizational goals.
  7. Infrastructure Roadmap and System Capacity Planning: Work with Infrastructure and Architecture teams to design and implement scaling roadmaps for server and serverless architectures using Containers and IaC tools like Ansible and Terraform. Conduct disaster recovery and failure testing to improve resiliency. Perform capacity planning for current and future demands.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Principal Site Reliability Engineer - Storage

Akamai Technologies GmbH

Remote

USD 148,000 - 308,000

2 days ago
Be an early applicant

Principal Site Reliability Engineer

Devoted Health

Remote

USD 166,000 - 185,000

7 days ago
Be an early applicant

Principal Site Reliability Engineer

Atlassian

Aurora

Remote

USD 170,000 - 275,000

30+ days ago

Lead Site Reliability Engineer (Remote)

Livepeer

New York

Remote

USD 120,000 - 160,000

11 days ago

Principal Platform Engineer - Dev Platform

Stitch Fix, Inc.

Remote

USD 157,000 - 232,000

10 days ago

Lead Site Reliability Engineer (AZURE) - Empower Product Group

ZipRecruiter

Greenville

Remote

USD 142,000 - 199,000

30+ days ago

Lead Site Reliability Engineer - Remote

Optum

Minnetonka

Remote

USD 106,000 - 195,000

12 days ago

Lead Site Reliability Engineer (Remote -CST)

Cognizant North America

Riverwoods

Remote

USD 81,000 - 142,000

30+ days ago

Lead Site Reliability Engineer/Architect (Remote)

Cognizant North America

Riverwoods

Remote

USD 120,000 - 162,000

30+ days ago