Enable job alerts via email!

Senior Site Reliability Engineer

Talent Groups

United States

On-site

USD 100,000 - 720,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Senior Site Reliability Engineer to manage Kubernetes environments, ensuring high availability and performance of their platform. The successful candidate will architect observability solutions, mentor junior engineers, and implement best practices for reliability and incident response.

Benefits

Medical insurance
Vision insurance
401(k)

Qualifications

  • Experience building and managing Kubernetes at scale.
  • Expertise in observability solutions spanning metrics, logs, and traces.
  • Proficiency in GitOps using tools like Helm, ArgoCD, or Flux.

Responsibilities

  • Ensure platform reliability and observability across Kubernetes environments.
  • Manage the full lifecycle of Kubernetes clusters: upgrades, migrations, and autoscaling.
  • Lead capacity planning and incident management processes.

Skills

Kubernetes Management
Observability Solutions
Linux Networking Fundamentals
GitOps Workflows
Service Mesh Technologies

Tools

Prometheus
Grafana
OpenTelemetry
Helm
Terraform

Job description

Our client seeks a Senior Site Reliability Engineer to lead the design, implementation, and operation of a Kubernetes environment, focusing on platform reliability with deep observability and building a dynamic, policy-driven traffic enforcement layer within a Kubernetes-hosted SASE platform. The ideal candidate will have experience building and managing Kubernetes at scale and on-prem.

Job Type: Direct Hire

This position is not eligible for visa sponsorship. No Corp to Corp or 3rd party agencies.

Responsibilities:

  • Ensure the SASE platform delivers secure, performant network experiences maintaining reliability, observability, and real-time traffic control while upholding engineering standards across a distributed infrastructure.
  • Manage mission-critical Kubernetes environments supporting distributed SaaS infrastructure across multiple regions and failure domains.
  • Architect and implement observability solutions across the full platform stack.
  • Integrate OpenStack infrastructure beneath Kubernetes, with expertise in its networking, storage, and orchestration layers.
  • Own platform stability, performance, and SLA attainment.
  • Lead capacity planning and scaling to ensure consistent performance under variable loads.
  • Establish incident management processes, on-call rotations, automated remediation, and post-incident analysis.
  • Conduct disaster-recovery exercises and failure simulations to validate resilience.
  • Build an observability stack offering actionable insights across platform, network, and application layers.
  • Implement golden-signals monitoring with alert thresholds and dashboards for health and customer impact.
  • Deploy distributed tracing to pinpoint performance bottlenecks and optimize critical paths.
  • Manage the full lifecycle of Kubernetes clusters: upgrades, migrations, resiliency testing, and autoscaling.
  • Automate cluster operations via GitOps using Helm, ArgoCD, Flux, and Terraform.
  • Optimize Kubernetes resource allocation and apply FinOps practices for cost control.
  • Design a Kubernetes-native traffic control plane for per-tenant and per-device session/bandwidth limits.
  • Architect enforcement logic via custom CRDs/controllers integrated with Prometheus and OpenTelemetry.
  • Implement and troubleshoot service-mesh technologies for advanced routing, security, and observability.
  • Develop real-time response workflows using Prometheus rule evaluation, webhook triggers, and automated tc/iptables enforcement.
  • Contribute to cloud-native architecture decisions and technical strategy.
  • Mentor junior engineers and evangelize Kubernetes best practices.

Skills and Experience:

  • Must have experience building and managing Kubernetes at scale on-prem or in datacenters.
  • Expertise in building observability solutions spanning metrics, logs, and traces.
  • Advanced mastery of Kubernetes controller patterns, custom resources, and operators (Kubebuilder preferred).
  • Deep knowledge of CNI plugins (Cilium) and dynamic network-policy enforcement.
  • Strong Linux networking fundamentals including tc, nftables, conntrack, iptables, and WireGuard.
  • Proficiency with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki.
  • Fluent in Go for Kubernetes controller development; familiarity with Python/Bash for scripting.
  • Skilled in GitOps workflows using Helm, Terraform, ArgoCD, or Flux.
  • Familiarity with network security and overlay architectures, including VXLAN, IPsec, and SDN routing.
  • Experience implementing and troubleshooting service meshes (Istio, Linkerd).
  • Hands-on with Kubernetes clusters atop OpenStack (Nova, Neutron, Ceph).
  • Knowledge of multi-cluster management tools (Fleet, Cluster API, Rancher).
  • Competency with chaos engineering frameworks (Chaos Mesh, Litmus).
Seniority level
  • Seniority level
    Not Applicable
Employment type
  • Employment type
    Full-time
Job function
  • Job function
    Information Technology
  • Industries
    IT Services and IT Consulting

Referrals increase your chances of interviewing at Talent Groups by 2x

Inferred from the description for this job

Medical insurance

Vision insurance

401(k)

Get notified about new Site Reliability Engineer jobs in United States.

Site Reliability Engineer L4, Netflix Technology Services
Site Reliability Engineer L5 - Open Connect

United States $100,000.00-$720,000.00 1 week ago

Site Reliability Engineer - 100 % Remote

United States $115,000.00-$135,000.00 1 week ago

United States $110,000.00-$130,000.00 2 days ago

United States $108,000.00-$125,000.00 1 week ago

Orange County, CA $75,000.00-$85,000.00 1 day ago

Junior Site Reliability Engineer (Remote)

United States $80,237.00-$139,077.00 3 days ago

United States $225,000.00-$344,800.00 1 week ago

New York City Metropolitan Area $90.00-$95.00 1 week ago

United States $100,000.00-$150,000.00 1 week ago

United States $140,000.00-$155,000.00 3 days ago

Software and Documentation Engineer (Remote)

Austin, TX $83,200.00-$156,000.00 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer

MongoDB

Remote

USD 127,000 - 249,000

6 days ago
Be an early applicant

Senior Site Reliability Engineer

Censys

Remote

USD 145,000 - 195,000

3 days ago
Be an early applicant

Senior Site Reliability Engineer New United States - Remote

Motive

Remote

USD 126,000 - 193,000

8 days ago

Senior Site Reliability Engineer ( Remote - US)

Jobgether

Remote

USD 120,000 - 160,000

3 days ago
Be an early applicant

Senior Site Reliability Engineer (Remote)

Fathom - AI Meeting Assistant

Remote

USD 180,000 - 230,000

15 days ago

Senior Site Reliability Engineer

Credit Acceptance

Remote

USD 117,000 - 174,000

19 days ago

Senior Site Reliability Engineer

Zillow Group

Remote

USD 120,000 - 160,000

15 days ago

Senior Site Reliability Engineer

General Dynamics Mission Systems

Aurora

Remote

USD 129,000 - 141,000

5 days ago
Be an early applicant

Senior Site Reliability Engineer

Akamai Technologies

Hybrid

USD 106,000 - 222,000

6 days ago
Be an early applicant