Enable job alerts via email!

Senior Site Reliability Engineer

Rackspace Technology

United States

Remote

USD 80,000 - 130,000

Full time

4 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative technology firm is seeking talented individuals to join their Professional Services Center of Excellence. This role focuses on solving complex business problems by enhancing application performance monitoring. You will work with cutting-edge tools, including Datadog and New Relic, to create exceptional customer experiences. The position requires collaboration with development teams to implement robust observability solutions, ensuring system reliability and performance. Join a company recognized for its commitment to diversity and employee satisfaction, where your contributions will shape the future of technology and customer success.

Qualifications

3+ years experience in AWS EKS and Azure AKS infrastructure.
Scripting experience with Python, Go, Bash, and AWS CLI tools.

Responsibilities

Implement Observability solutions and maintain scalable systems.
Develop monitoring tools, alerts, and dashboards for system health.

Skills

AWS EKS

Azure AKS

Terraform

Kafka

SaaS environments

SRE

Prometheus

Grafana

Datadog

GitOps

Python

Bash

AWS CLI

Disaster recovery strategies

Tools

Kubernetes

ELK

Rackspace is building up its Professional Services Center of Excellence on Application Performance Monitoring Suites.

If you enjoy solving complex business problems and can contribute to building the next generation of modern applications for our customers—helping them understand the connections between application performance, user experience, and business outcomes—creating amazing customer experiences with modern interpretations of SRE, Observability using Datadog, New Relic, AppDynamics, or Dynatrace, then join us!

Rackspace enables businesses to accelerate digital transformation through our innovative data, integration solutions, and tools that help you fix problems quickly, maintain complex systems, and improve code. We believe Datadog, AppDynamics, or New Relic will be significant contributors to our work, and we seek talented, creative, and thoughtful individuals to shape Observability Engineering for our customers.

You Will:

Work with customers and implement Observability solutions
Build and maintain scalable systems and robust automation supporting engineering goals
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Proactively gather and analyze metric and log data to perform anomaly detection, performance tuning, capacity planning, and fault isolation
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security, and performance standards
Document and share solutions collaboratively with team members
Maintain a deep understanding of the customer’s business and technical environment
Identify performance bottlenecks, anomalous system behavior, and resolve root causes of service issues

You Need to Have:

At least 3+ years of experience designing, building, and maintaining AWS EKS and Azure AKS infrastructure with Terraform
3+ years' experience with Kafka in large-scale environments with hundreds of terabytes to petabytes of data from numerous endpoints
Experience designing, building, and maintaining SaaS environments for 3+ years
3+ years as an SRE within a large team, with solid experience with Prometheus, Grafana, Datadog, ELK, etc.
3+ years building and running Kubernetes clusters with expertise in scaling, operators, and troubleshooting
Experience with observability (monitoring, logging, tracing, metrics) for 3+ years
Experience with GitOps CI/CD processes for 3+ years
Scripting experience with Python, Go, Bash, and AWS CLI tools for 3+ years
Knowledge of security operations, policies, infrastructure, key management, and encryption at rest and in transit for 3+ years
Experience implementing and maintaining disaster recovery strategies (MySQL, Zookeeper, etc.) for 3+ years

#LI-JB2

About Rackspace Technology

We are multicloud solutions experts, combining our expertise with leading technologies across applications, data, and security to deliver end-to-end solutions. We have a proven record of advising customers, designing scalable solutions, and optimizing returns. Named a best place to work repeatedly by Fortune, Forbes, and Glassdoor, we attract and develop world-class talent. Join us to embrace technology, empower customers, and deliver the future.

Similar jobs

FlightAware- Sr. Site Reliability Engineer (Remote)

Lensa

Austin

Remote

USD 101,000 - 203,000

2 days ago

Be an early applicant

Sr. Site Reliability Engineer

Dayforce

Remote

USD 80,000 - 120,000

Yesterday

Be an early applicant