Enable job alerts via email!

Staff Site Reliability Engineer

Crisis Text Line

United States

Remote

USD 120,000 - 160,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Crisis Text Line is seeking a Staff Site Reliability Engineer to enhance platform reliability and support engineering teams through infrastructure management automation. In this fully remote role, you will lead a team, develop scalable AWS solutions, and foster a collaborative culture, contributing directly to the mission of supporting texters and volunteers.

Benefits

20 paid holidays including federal holidays
Flexible paid time off
Medical, dental, and vision benefits at no cost to employees
403B retirement plan with 3% contribution
12 weeks paid parental leave
Student loan repayment after 2 years
Monthly stipends for mental health and internet service

Qualifications

  • Extensive experience in infrastructure management and Site Reliability Engineering.
  • Proficient in AWS and infrastructure as code tools like Terraform.
  • Strong scripting and automation skills with a focus on DevOps practices.

Responsibilities

  • Design, implement, and maintain highly available AWS infrastructure.
  • Lead and mentor a team of SREs, ensuring best practices and continuous improvement.
  • Automate tasks to enhance engineer productivity and streamline workflows.

Skills

AWS
Infrastructure as Code
DevOps
Containerization
Automation
Observability
Scripting
Security Best Practices
Problem-solving
Communication

Education

Bachelor's degree in Computer Science or related field
Master's degree (preferred)

Tools

Terraform
CloudFormation
Docker
Kubernetes
GitHub Actions
Datadog

Job description

This is a remote position.

Role

Job Summary: As a Staff Site Reliability Engineer (SRE), reporting to the Senior Engineering Manager of SRE/Infrastructure, you will be a key technical leader ensuring the reliability, scalability, and security of our platform. In this role, you will play a strategic part in architecting, building, and maintaining the tooling that empowers our software engineering teams and managing the infrastructure that supports our staff and volunteers in delivering the Crisis Text Line service. You will collaborate closely with developers to drive performance optimization, implement best practices, and ensure a secure environment. With a significant focus on enhancing engineer productivity through automation and streamlined workflows, you’ll directly contribute to our mission of supporting texters, volunteers, and staff. This position requires extensive experience in infrastructure management, automation, and Site Reliability Engineering (SRE) practices.

Key Responsibilities

  • Assisting to lead and mentor a team of 5 SREs, fostering a collaborative and innovative work environment.
  • Working closely with the 3 staff in TechOps/Security on enforcement of security best practices across the infrastructure and development processes.
  • Design, implement, and maintain our highly available and scalable AWS infrastructure that powers our service.
  • Collaborate with developers to optimize application performance and reliability.
  • Develop and maintain monitoring, logging, and alerting systems to ensure system health and performance.
  • Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
  • Respond to and resolve incidents, minimizing downtime and ensuring quick recovery.
  • Support and encourage a diversity of backgrounds, voices, and perspectives on the engineering team
  • Proactively communicate expectations, progress, and issues to engineers, product managers, and other colleagues with clarity and kindness, delivering and receiving feedback respectfully
  • Spread knowledge, provide mentorship, and promote technical best practices
  • Learn both independently and from your colleagues, stretch yourself, and grow as an engineer and teammate
  • Write and review high-quality, easy-to-read, and testable code that follows best practices
  • Manage time successfully by focusing on priorities, delivering on deadlines, and asking for help when stuck
  • Providing engineering input and estimating work both during refinement and architecture design.
  • Participate in retrospectives and post-mortems to improve our processes and operations
  • Conduct regular security audits and vulnerability assessments, addressing any identified issues.
  • Stay up-to-date with industry trends and emerging technologies, recommending and implementing improvements as needed.

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or related field (Master's degree preferred).
  • Proven experience as a Staff SRE or in a similar SRE role, with experience in observability and chose engineering. strong focus on infrastructure and DevOps in a software delivery capacity.
  • Experience maintaining the reliability of online SaaS/PaaS with a 7/24 schedule.
  • Proficiency in AWS and infrastructure as code (e.g., Terraform, CloudFormation).
  • Strong scripting and automation skills and in-depth knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
  • Proven experience in implementing CI/CD pipelines and tools (Github Actions) and observability tools (Datadog).
  • A commitment to ethical practices, data privacy, and security.
  • Solid understanding of network protocols, security principles, and best practices.
  • Excellent problem-solving skills and the ability to work under pressure, with strong communication skills to collaborate effectively with cross-functional teams.
  • Ability to learn quickly and manage your time successfully by focusing on priorities, delivering on deadlines, and asking for help when needed.
  • Strong communication skills, with the ability to collaborate effectively with cross-functional teams.
  • Demonstrates an understanding of essential computer science principles and how to apply them to solve problems. This including basic data structures, control structures and functions

Preferred Qualifications

  • Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • Experience implementing Failure Injection / Chaos Engineering practices.
  • Cloud Solution Architect certifications or completed training (e.g. AWS Cloud Practitioner Essentials and/or AWS Certified Solutions Architect - Associate) GCP or Azure.
  • Strong experience with AWS Solution Architecture across various web applications and APIs, Databricks, and AI/ML workloads.
  • Knowledge of compliance and regulatory standards (e.g., GDPR, HIPAA, ISO 27001, SOC2, etc.).
  • Experience in a non-profit or mission-driven organization.

Benefits

Crisis Text Line employee benefits are thoughtfully designed using an equity lens, acknowledging that we are all unique human beings with individual life circumstances that require flexibility and support.

Benefits Include

  • 20 paid holidays including:
    • Federal holidays like Juneteenth and Labor Day
    • Election day
    • Holiday break from Dec 24 through January 1
    • 2 renewal days
    • 2 floating holidays
  • Flexible paid time off, including:
    • 15 vacation days
    • 3 personal days
    • 7 sick days
  • Medical, dental, and vision benefits for the staff member and family at no cost to the employee
  • 403B retirement plan (the nonprofit equivalent of a 401K): 3% contribution by Crisis Text Line to support building financial wellness, regardless of personal contribution
  • 12 weeks paid parental leave (after 6 months of employment)
  • Student loan repayment (after 2 years of continuous full time service)
  • Stipends/Allowances
    • Mental health (Monthly)
    • Internet Service (Monthly)
    • Student Loan repayment (Monthly, after 2 years of service)
    • Professional Development (Annually)
(Benefits are only for US-based employees. International benefits may differ).

Seniority level
  • Seniority level
    Mid-Senior level
Employment type
  • Employment type
    Full-time
Job function
  • Job function
    Information Technology and Engineering
  • Industries
    Non-profit Organizations

Referrals increase your chances of interviewing at Crisis Text Line by 2x

Get notified about new Software Engineer jobs in United States.

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Staff Site Reliability Engineer

Wikimedia Foundation

null null

Remote

Remote

USD 129,000 - 201,000

Full time

15 days ago

Staff Site Reliability Engineer (Staff SRE) (Remote)

SailPoint

null null

Remote

Remote

USD 129,000 - 240,000

Full time

30+ days ago

Staff Site Reliability Engineer

Crisis Text Line International

null null

Remote

Remote

USD 100,000 - 150,000

Full time

Yesterday
Be an early applicant

Staff Site Reliability Engineer

Crisis Text Line, Inc.

Seattle null

Remote

Remote

USD 120,000 - 160,000

Full time

3 days ago
Be an early applicant

Staff Site Reliability Engineer

Attentive Inc

null null

Remote

Remote

USD 156,000 - 240,000

Full time

4 days ago
Be an early applicant

Staff Site Reliability Engineer

Berkley Hunt

null null

On-site

On-site

USD 100,000 - 180,000

Full time

6 days ago
Be an early applicant

Staff Software Engineer - Reliability Engineer (Remote)

The Home Depot

Atlanta null

Remote

Remote

USD 110,000 - 140,000

Full time

2 days ago
Be an early applicant

Staff Software Engineer - Site Reliability

MedStar Health

null null

Remote

Remote

USD 120,000 - 170,000

Full time

7 days ago
Be an early applicant

Sr. Staff Site Reliability Engineer

Davita Inc.

null null

Remote

Remote

USD 140,000 - 200,000

Full time

15 days ago