Enable job alerts via email!

Staff Site Reliability Engineer

Primer

United States

On-site

USD 180,000 - 230,000

Full time

26 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is looking for a Staff Site Reliability Engineer to join their Infrastructure team. This role is crucial for designing, building, and maintaining systems that ensure reliability and performance. You will collaborate with cross-functional teams to define service level objectives and implement best practices for observability and incident management. The company values a diverse and inclusive team, offering competitive compensation, comprehensive benefits, and a flexible work environment. If you are passionate about operational excellence and want to make a significant impact, this opportunity is for you.

Benefits

Flexible vacation policy
Wellness Days
100% paid leave for parents
Full medical coverage
Dental coverage
Vision coverage
Fertility benefits
Mental health coverage
Gympass+ Membership
401(k) plan

Qualifications

  • 10+ years in production systems engineering or SRE roles.
  • Experience with observability tools and microservices architectures.

Responsibilities

  • Design and maintain fault-tolerant systems for continuous availability.
  • Drive automation and tooling to enhance operational efficiency.

Skills

Production systems engineering
Linux systems administration
Bash/Linux scripting
Observability tools
Microservices architectures
Kubernetes
CI/CD pipelines
Programming languages (Python, Go)
Cloud networking

Tools

Datadog
New Relic
Prometheus
ELK

Job description

Primer exists to make the world a safer place. We do this by providing trusted decision-ready AI to the world's most critical organizations. Our software enables leaders, operators, and analysts to better understand the changing world around us in real time and make informed decisions when the stakes are high. Primer has offices in San Francisco, Pasadena, CA and Arlington, VA. For more information, please visit https://primer.ai/

As a Staff Site Reliability Engineer, you will be a key member of our Infrastructure team, dedicated to designing, building, and maintaining fault-tolerant systems. You will collaborate with Product and Engineering teams to define and meet service level objectives (SLOs), implement and enhance observability, and contribute to the evolution of our Engineering practices. Your expertise in observability, capacity planning, automation, and incident management will be critical to sustaining our mission-critical operations and delivering a seamless experience for developers and customers alike.
Role Responsibilities - How You Will Make an Impact

  • Architect, Build, and Scale: Design and architect our solutions for continuous availability and scalability in production.
  • Uphold Reliability Standards: Define and review Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Work with engineering teams to ensure new services or features meet reliability and performance targets.
  • Drive Automation & Tooling: Develop tools, frameworks, and platforms to streamline repetitive tasks (e.g., monitoring, incident response). Write software that improves reliability and security (e.g., automated testing, canary deployments).
  • Incident Management & Postmortems: Participate in on-call (Livesite) rotations; lead and coordinate incident responses. Conduct thorough post-incident reviews, share learnings, and implement improvements to mitigate future occurrences.
  • Observability Best Practices: Develop and maintain best-in-class monitoring, logging, and alerting systems to provide actionable insights into the health of infrastructure and services. Advise teams on instrumentation best practices, ensuring comprehensive coverage of critical paths and dependencies.
  • Cross-Functional Collaboration: Work closely with product managers, software engineers, and security teams to deliver end-to-end solutions with reliability built in.

Technical Skills - Need to Have (Required)

  • 10+ years experience in production systems engineering, SRE, or DevOps roles supporting large-scale, mission-critical platforms
  • 10+ years experience with Linux systems administration and Bash/Linux scripting
  • 5+ years experience with observability tools (monitoring, logging, tracing) such as Datadog, New Relic, Prometheus, ELK, or similar
  • 5+ years experience with microservices architectures, Kubernetes, and CI/CD pipelines
  • 2+ years experience in at least one programming language (e.g., Python, Go) with a strong focus on building automation and tooling
  • Solid understanding of cloud networking (e.g., mesh networking, TCP/IP, DNS, load balancing, VPNs)

Nice to Have:

  • Experience building or running distributed systems that include GPU heavy workloads or LLMs
  • Strong knowledge of the AWS platform with experience in cost optimization and capacity planning
  • Track record of leading incident response efforts and conducting detailed postmortems
  • Security awareness and familiarity with secure coding, encryption, and compliance best practices
  • Excellent communication skills, with the ability to explain complex topics to both technical and non-technical audiences

The annual cash compensation range for this position is US $180,000 to US $230,000. Final compensation will be determined based on experience and skills and may vary from the range listed above.

Primer works closely with the U.S. defense and intelligence establishment. Any offer of employment is conditioned on an applicant or employee being able to meet any applicable government contract requirements. The company may rescind any offer of employment to an applicant or terminate an employee if the applicant or employee is unable to perform the functions of the position in compliance with applicable government contracts or if an applicant or employee makes a false attestation of compliance.

What We Offer

We are a series D funded company with investors from Addition, USIT, Lux Capital, Amplify Partners, Addition Capital, Bloomberg Beta, and others. We are intentional around building a diverse and inclusive team of subject matter experts to better advocate for the needs of our users.

We care a lot about our work and about the well being of our team. We encourage everyone to work at a sustainable pace and have a flexible vacation policy for team members to utilize, Wellness Days and 100% paid leave for parents of growing families.

We offer competitive compensation and comprehensive benefits. This includes full medical, dental, and vision coverage, fertility benefits through Carrot, mental health coverage on demand with Headspace Care+, Gympass+ Membership via Wellhub, One Medical Membership, 401(k), remote work stipends, and monthly internet allowance.

Primer is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. Please see the United States Department of Labor's EEO poster and EEO poster supplement for additional information.

If you need assistance or accommodation due to a disability, you may contact us at info@primer.com.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

[Hiring] Staff Site Reliability Engineer @Wikimedia Foundation

Wikimedia Foundation

Remote

USD 129 000 - 201 000

30+ days ago

Staff Site Reliability Engineer Pasadena, California, United States; Remote; San Francisco, Cal[...]

Primer

San Francisco

Remote

USD 180 000 - 230 000

30 days ago

Staff Site Reliability Engineer

Wikimedia Foundation

Mississippi

Remote

USD 129 000 - 201 000

30+ days ago

[Hiring] Staff Site Reliability Engineer @Primer.ai

Primer.ai

Remote

USD 180 000 - 230 000

21 days ago

Staff Software Engineer, Reliability Engineer - Store Systems & Services (Remote)

Lensa

Atlanta

Remote

USD 120 000 - 190 000

Yesterday
Be an early applicant

Staff Software Engineer, Reliability Engineer - Store Systems & Services (Remote)

Lensa

Atlanta

Remote

USD 120 000 - 190 000

2 days ago
Be an early applicant

Staff Site Reliability Engineer

Rivian and Volkswagen Group Technologies

Palo Alto

On-site

USD 186 000 - 233 000

2 days ago
Be an early applicant

Staff Site Reliability Engineer - FedRAMP

Tenable Network Security, Inc.

Columbia

Hybrid

USD 161 000 - 216 000

7 days ago
Be an early applicant

Staff Data Platform Engineer - (Remote - US)

Jobgether

Remote

USD 170 000 - 720 000

2 days ago
Be an early applicant