Enable job alerts via email!

Senior SRE (Site Reliability Engineer) - Remote

SailPoint

United States

Remote

USD 100,000 - 140,000

Full time

Today
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

SailPoint, a leader in identity security, seeks a Senior Site Reliability Engineer to join their Identity Security Cloud team. This role involves enhancing system reliability, scalability, and performance, working closely with development teams to influence design and architecture. Ideal candidates will have 5+ years of SRE experience, familiarity with cloud technologies, and strong problem-solving skills.

Benefits

Equal Opportunity Employer
Remote work options available
Alternative application methods for individuals with disabilities

Qualifications

  • 5+ years of experience in Site Reliability Engineering.
  • Strong understanding of cloud platforms (AWS, GCP, Azure).
  • Proficient in one scripting language (Python, Bash, Go).

Responsibilities

  • Ensure system reliability and performance through operational metrics.
  • Collaborate with teams to plan capacity and scale services.
  • Implement automation to improve operational efficiency.

Skills

Problem Solving
Collaboration
Troubleshooting
Automation
Reliability Engineering

Education

Bachelor's Degree in Computer Science

Tools

Terraform
Docker
Kubernetes
Prometheus
Grafana

Job description

SailPoint is the leader in identity security for the cloud enterprise. Our identity security solutions secure and enable thousands of companies worldwide, giving our customers unmatched visibility into the entirety of their digital workforce, ensuring workers have the right access to do their job – no more, no less.

We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join an I dentity S ecurity C loud software development team. This is an embedded role, meaning you will be a full member of the development team, working closely with software engineers, infrastructure platform services, engineering managers, and other stakeholders to ensure the reliability, scalability, and performance of teams’ services. You will be responsible for leveraging the infrastructure, tooling, and processes that support our applications in dev and production . This role offers a unique opportunity to directly influence the design and architecture of our systems from a reliability and performance perspective.

Responsibilities:

Work with the development and service owners at the intersection of development and operations to solve performance issues and ensure system scalability.

  • Reliability Engineering: Design, develop, and implement solutions to improve the reliability, availability, performance, and scalability of our systems. Work with technical leaders and infrastructure platform services to develop alerts and dashboards.

  • Operational Excellence: Own and improve key operational metrics (SLIs, SLOs, Error Budgets, monitoring and alerting) for team related services and drive continuous improvement through post-incident reviews and blameless postmortems of non-functional issues. Develop and maintain comprehensive monitoring, alerting to proactively identify and resolve issues. Create and maintain dashboards, conducting ongoing reviews to address and optimize gaps. Improve operational processes and team practices by working with technical leaders and NOC teams .

  • Capacity Planning: Collaborate with technical leads, DevOps/SRE and infra teams to forecast capacity needs and ensure sufficient resources are available to support growth.

  • Performance Optimization: Collaborate with performance SMEs to identify and address production performance bottlenecks through profiling, tuning, and optimization of services and infrastructure.

  • Automation: Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.

  • Collaboration: Work closely with Software, Performance and Test Engineers to influence system design and architecture for operability and reliability.

  • Documentation : Review and contribute to c lear and concise documentation for systems, processes, runbooks, and procedures.

  • On-Call: Participate in a 24/7 on-call rotation t o gain subject matter expertise in the domain .

  • Incident Management: L ead the incident postmortem efforts, working with the SMEs to ensure t imel y compilation of reports to help drive completion of post-incident action.

  • Troubleshooting skills: Excellent diagnostic and problem-solving skills, with the ability to analyze complex systems and data

Qualifications:

  • Bachelor’s degree in computer science, a related field, or equivalent practical experience.

  • Proven 5+ years of SRE experience

  • Strong understanding of SRE principles and practices.

  • Experience with cloud platforms (AWS, GCP, or Azure).

  • Proficiency in at least one scripting language (e.g., Python, Bash, Go).

  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana , Honeycomb, OpenSearch ).

  • L evel of coding experience beyond simple s cripts with one of the programming languages such as Go, Java, or Python to help build reliability engineering ; to evaluate and identify where service code can be optimized for enhanced reliability practices.

  • Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).

  • Understanding of network protocols, and security best practices

  • Familiarity with DevOps culture and practices and experience with CI/CD toolchains (Jenkins, ArgoCD , SpaceLift )

  • Experience with Incident Response tools and processes (PagerDuty)

  • Experience with Infrastructure as Code (Terraform, Helm)

  • Strong problem-solving and troubleshooting skills.

  • Excellent communication and collaboration skills.

  • Ability to work independently and as part of a team to achieve the SRE agenda.

Preferred Qualifications:

  • Technology experience: Kafka, relational databases, performance tuning (JVM, Go)

  • Experience with Grafana K6 – Continuous Performance Tool

In the first 30 days you will:

  • Meet team, understand the team’s mission and vision

  • Gain clarity on various roles and expectations

  • Complete development environment setup

  • Read guides, documentation, perform mandatory training

  • Learn company processes, benefits

By 6 months you should:

  • Understand team goals and OKR’s for the quarter and beyond

  • Complete initial analysis and implementation of SRE team assignments

  • Be comfortable with tools, systems and processes used on a day-to-day basis

  • Complete project work, both supervised and unsupervised

SailPoint is an equal opportunity employer and we welcome all qualified candidates to apply to join our team. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other category protected by applicable law.

Alternative methods of applying for employment are available to individuals unable to submit an application through this site because of a disability. Contact hr@sailpoint.com or mail to 11120 Four Points Dr, Suite 100, Austin, TX 78726, to discuss reasonable accommodations.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer (AWS, AI/ML, & APM)

Davita Inc.

Remote

USD 120,000 - 160,000

3 days ago
Be an early applicant

Senior Site Reliability Engineer (Remote)

3C Deutschland GmbH

Remote

USD 133,000 - 240,000

5 days ago
Be an early applicant

Senior Site Reliability Engineer (Remote)

Experian Group

Remote

USD 130,000 - 180,000

7 days ago
Be an early applicant

Site Reliability Engineer - Observability

Rocket Lab

Remote

USD 120,000 - 180,000

7 days ago
Be an early applicant

Site Reliability Engineer (FULLY REMOTE)

Splunk

Nevada

Remote

USD 82,000 - 106,000

30+ days ago

[Hiring] Site Reliability Engineer @The Calyx Institute

The Calyx Institute

Remote

USD 125,000 - 135,000

27 days ago

Senior Site Reliability Engineer

ECS

Fairfax

On-site

USD 120,000 - 180,000

20 days ago

Lead Site Reliability Engineer

Centene Corporation

Missouri

On-site

USD 100,000 - 187,000

30+ days ago