Our client seeks a Senior Site Reliability Engineer to lead the design, implementation, and operation of a Kubernetes environment, focusing on platform reliability with deep observability and building a dynamic, policy-driven traffic enforcement layer within a Kubernetes-hosted SASE platform. The ideal candidate will have experience building and managing Kubernetes at scale and on-prem.
Job Type: Direct Hire
This position is not eligible for visa sponsorship. No Corp to Corp or 3rd party agencies.
Responsibilities:
- Ensure the SASE platform delivers secure, performant network experiences maintaining reliability, observability, and real-time traffic control while upholding engineering standards across a distributed infrastructure.
- Manage mission-critical Kubernetes environments supporting distributed SaaS infrastructure across multiple regions and failure domains.
- Architect and implement observability solutions across the full platform stack.
- Integrate OpenStack infrastructure beneath Kubernetes, with expertise in its networking, storage, and orchestration layers.
- Own platform stability, performance, and SLA attainment.
- Lead capacity planning and scaling to ensure consistent performance under variable loads.
- Establish incident management processes, on-call rotations, automated remediation, and post-incident analysis.
- Conduct disaster-recovery exercises and failure simulations to validate resilience.
- Build an observability stack offering actionable insights across platform, network, and application layers.
- Implement golden-signals monitoring with alert thresholds and dashboards for health and customer impact.
- Deploy distributed tracing to pinpoint performance bottlenecks and optimize critical paths.
- Manage the full lifecycle of Kubernetes clusters: upgrades, migrations, resiliency testing, and autoscaling.
- Automate cluster operations via GitOps using Helm, ArgoCD, Flux, and Terraform.
- Optimize Kubernetes resource allocation and apply FinOps practices for cost control.
- Design a Kubernetes-native traffic control plane for per-tenant and per-device session/bandwidth limits.
- Architect enforcement logic via custom CRDs/controllers integrated with Prometheus and OpenTelemetry.
- Implement and troubleshoot service-mesh technologies for advanced routing, security, and observability.
- Develop real-time response workflows using Prometheus rule evaluation, webhook triggers, and automated tc/iptables enforcement.
- Contribute to cloud-native architecture decisions and technical strategy.
- Mentor junior engineers and evangelize Kubernetes best practices.
Skills and Experience:
- Must have experience building and managing Kubernetes at scale on-prem or in datacenters.
- Expertise in building observability solutions spanning metrics, logs, and traces.
- Advanced mastery of Kubernetes controller patterns, custom resources, and operators (Kubebuilder preferred).
- Deep knowledge of CNI plugins (Cilium) and dynamic network-policy enforcement.
- Strong Linux networking fundamentals including tc, nftables, conntrack, iptables, and WireGuard.
- Proficiency with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki.
- Fluent in Go for Kubernetes controller development; familiarity with Python/Bash for scripting.
- Skilled in GitOps workflows using Helm, Terraform, ArgoCD, or Flux.
- Familiarity with network security and overlay architectures, including VXLAN, IPsec, and SDN routing.
- Experience implementing and troubleshooting service meshes (Istio, Linkerd).
- Hands-on with Kubernetes clusters atop OpenStack (Nova, Neutron, Ceph).
- Knowledge of multi-cluster management tools (Fleet, Cluster API, Rancher).
- Competency with chaos engineering frameworks (Chaos Mesh, Litmus).
Seniority level
Seniority level
Not Applicable
Employment type
Job function
Job function
Information TechnologyIndustries
IT Services and IT Consulting
Referrals increase your chances of interviewing at Talent Groups by 2x
Inferred from the description for this job
Medical insurance
Vision insurance
401(k)
Get notified about new Site Reliability Engineer jobs in United States.
Site Reliability Engineer L4, Netflix Technology Services
Site Reliability Engineer L5 - Open Connect
United States $100,000.00-$720,000.00 1 week ago
Site Reliability Engineer - 100 % Remote
United States $115,000.00-$135,000.00 1 week ago
United States $110,000.00-$130,000.00 2 days ago
United States $108,000.00-$125,000.00 1 week ago
Orange County, CA $75,000.00-$85,000.00 1 day ago
Junior Site Reliability Engineer (Remote)
United States $80,237.00-$139,077.00 3 days ago
United States $225,000.00-$344,800.00 1 week ago
New York City Metropolitan Area $90.00-$95.00 1 week ago
United States $100,000.00-$150,000.00 1 week ago
United States $140,000.00-$155,000.00 3 days ago
Software and Documentation Engineer (Remote)
Austin, TX $83,200.00-$156,000.00 2 weeks ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.