Enable job alerts via email!

Senior Site Reliability Engineer

Talent Groups

McKinney (TX)

Hybrid

USD 120,000 - 160,000

Full time

3 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is looking for a Senior Site Reliability Engineer to oversee Kubernetes infrastructure in a high-performance multi-tenant SaaS environment. The role involves extensive hands-on work with Kubernetes internals, automation, and performance optimization. Candidates should have proven expertise in Kubernetes management, software development with Go, and building observability stacks.

Benefits

Medical insurance

Vision insurance

401(k)

Qualifications

Expertise in managing on-prem Kubernetes clusters.
Experience with Kubernetes internals and Go programming.
Background in Linux engineering.

Responsibilities

Architect and manage on-prem Kubernetes clusters.
Build observability stacks using monitoring tools.
Implement Kubernetes-native traffic enforcement.

Skills

Production-grade Kubernetes expertise

Linux engineering

Observability stacks

Python

Bash

CNI plugins

OpenStack

Service mesh

1 week ago Be among the first 25 applicants

Direct message the job poster from Talent Groups

Senior Technical Recruiter at Talent Groups

Location: Hybrid - McKinney, TX 75070

Type: Full-Time, Direct Hire. Applicants must be authorized to work in the United States without the need for current or future visa sponsorship. At this time, we are unable to consider candidates who require sponsorship.

We’re looking for a Senior Site Reliability Engineer (SRE) to architect, build, and own a Kubernetes-based infrastructure platform powering a high-performance, real-time, multi-tenant SaaS environment. This role is centered around on-premises Kubernetes in data center environments, with a strong focus on traffic enforcement, observability, and reliability at scale.

You’ll join a forward-thinking engineering team responsible for developing control systems, building traffic-routing logic, and maintaining a resilient cloud-native platform from the metal up. You’ll be hands-on in Kubernetes internals, networking, and automation—playing a critical role in ensuring reliability, performance, and visibility across the platform.

What You'll Do

Own the architecture, operations, and lifecycle of on-prem Kubernetes clusters in high-scale production environments.
Build and maintain observability stacks using tools like Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki to provide actionable insight and proactive alerting.
Implement and optimize Kubernetes-native traffic enforcement across multi-tenant SaaS workloads, including per-tenant fairness and routing enforcement.
Work directly in Go to extend Kubernetes functionality via CRDs, operators, or controllers.
Manage the care, feeding, and scaling of OpenStack clusters across global data centers.
Lead SRE best practices: from automated remediation and capacity planning to disaster recovery and performance optimization.
Design and manage advanced CNI configurations (Cilium, overlay networks, etc.) and modern SDN/NFV patterns.
Collaborate with software engineering teams to ensure seamless integration between infrastructure and application layers.

Must-Have Qualifications

Production-grade Kubernetes expertise—you’ve architected, deployed, and managed your own clusters on-prem (not just EKS/GKE/AKS).
Hands-on experience with Kubernetes internals, including CRDs, controllers, and cluster APIs.
Fluency in Go, with experience developing Kubernetes operators or integrations.
Strong Linux engineering background, with deep command-line and troubleshooting expertise.
Proven success building observability stacks (Prometheus, Grafana, OpenTelemetry, etc.).
Experience with CNI plugins (e.g., Cilium, Calico, overlay networks) and container networking.
Working knowledge of Python and Bash for scripting and automation.
Experience with OpenStack (Nova, Neutron, Ceph) and its integration with Kubernetes environments.
Familiarity with Helm, Terraform, ArgoCD, or Flux for Kubernetes GitOps and infrastructure automation.
Exposure to service mesh tools like Istio or Linkerd.

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Full-time

Job function

Job function
Information Technology
Industries
IT Services and IT Consulting, IT System Custom Software Development, and Computer and Network Security

Referrals increase your chances of interviewing at Talent Groups by 2x

Inferred from the description for this job

Medical insurance

Vision insurance

401(k)

Get notified when a new job is posted.

Software and Documentation Engineer (Remote)

Austin, TX $83,200.00-$156,000.00 2 weeks ago

Site Reliability Engineer (SRE, Remote US)

Austin, TX $120,000.00-$160,000.00 3 months ago

Site Reliability Engineer (FULLY REMOTE)

Principal Cloud Security Engineer – Azure

Dallas, TX $152,311.00-$197,689.00 4 days ago

Senior Site Reliability Engineer (SRE) - REMOTE

Texas, United States $120,000.00-$160,000.00 1 week ago

Austin, TX $175,000.00-$200,000.00 1 month ago

United States $130,000.00-$140,000.00 1 day ago

Austin, TX $85,000.00-$95,000.00 5 days ago

Dallas, TX $80,000.00-$125,000.00 3 days ago

Irving, TX $149,600.00-$224,400.00 1 day ago

Site Reliability Engineer-FedRAMP (FULLY REMOTE)

Senior Site Reliability Engineer (SRE) - REMOTE

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs