Enable job alerts via email!

Sr. Site Reliability Engineer (SRE)

tsworks

United States

Remote

USD 100,000 - 125,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Site Reliability Engineer to architect and maintain high availability infrastructure that supports critical applications. This role involves leading the implementation of Infrastructure as Code using AWS CDK, optimizing automation for deployments, and enhancing CI/CD pipelines. The ideal candidate will have a strong background in cloud computing, particularly AWS, and experience with Kubernetes and container orchestration. Join a forward-thinking team that values innovation and collaboration, where you can mentor junior engineers and drive technical excellence in a dynamic environment.

Qualifications

  • Expertise in AWS and cloud computing with strong infrastructure design knowledge.
  • Extensive experience with CI/CD tools and automation practices.

Responsibilities

  • Design and maintain scalable infrastructure for business-critical applications.
  • Lead incident response and root cause analysis for infrastructure issues.

Skills

Site Reliability Engineering
DevOps
AWS
Kubernetes
CI/CD
Python
Bash
Troubleshooting
Problem-solving
Leadership

Education

6-10+ years of experience in Site Reliability Engineering or DevOps

Tools

AWS CDK
Terraform
Ansible
Datadog
Prometheus
Grafana
SonarQube

Job description

Role & responsibilities

  • Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
  • Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
  • Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
  • Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
  • Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
  • Collaborate with development teams to ensure infrastructure aligns with application requirements and follows best practices for performance, security, and cost efficiency.
  • Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
  • Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
  • Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.
  • Stay up to date with trends and advancements in cloud computing, containerization, and DevOps methodologies to drive improvements in our technology stack.

Preferred candidate profile

  • 6 -10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
  • Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
  • Experience with multi-cloud environments, including Azure and GCP, is highly desirable.
  • Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
  • Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
  • Extensive experience with CI/CD tools and practices, with hands-on expertise in automating infrastructure (EKS, ALB, NLB, Route 53, WAF, Network components) and application deployments.
  • Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
  • In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
  • Preferred knowledge of Content Delivery Networks (CDNs) for optimizing application performance and scalability.
  • Strong troubleshooting and problem-solving skills, with a proactive approach to incident management and root cause analysis.
  • Strong application knowledge, including building and deploying Java Spring Boot and Angular applications.
  • Experience in setting up unit tests and code quality tools, such as SonarQube, to ensure robust application development
  • Proven ability to work independently and lead initiatives while collaborating with cross-functional teams.
  • Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Sr. Site Reliability Engineer

Dayforce

Remote

USD 80,000 - 120,000

Yesterday
Be an early applicant

FlightAware- Sr. Site Reliability Engineer (Remote)

Lensa

Austin

Remote

USD 101,000 - 203,000

2 days ago
Be an early applicant

Sr. Site Reliability Engineer

Dayforce US, Inc.

Minnesota

Remote

USD 80,000 - 130,000

7 days ago
Be an early applicant

FlightAware- Sr. Site Reliability Engineer (Remote)

Pratt & Whitney

Remote

USD 101,000 - 203,000

5 days ago
Be an early applicant

Senior Site Reliability Engineer (Data Platforms SRE)

Wikimedia Foundation

Remote

USD 101,000 - 158,000

12 days ago

Senior Site Reliability Engineer

Bitwarden

Santa Barbara

Remote

USD 120,000 - 185,000

8 days ago

Senior Site Reliability Engineer

Bitwarden Inc.

California

Remote

USD 120,000 - 185,000

10 days ago

Senior Site Reliability Engineer - Wikimedia Enterprise

Wikimedia Foundation

Remote

USD 105,000 - 164,000

26 days ago

Senior Site Reliability Engineer

Yelosoftware

Remote

USD 90,000 - 150,000

Yesterday
Be an early applicant