Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back and blaze forward with the full power of the open cloud in their hands.
Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $125m in revenue and is the leading specialized storage cloud, managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals.
We are seeking a Sr. Manager, SRE to join our team!
We are seeking a seasoned Sr. Manager, Site Reliability Engineering (SRE) to lead a global team of engineers responsible for the performance, availability, and reliability of our distributed services and infrastructure. This leader will drive SRE strategy, implement operational excellence frameworks, and partner with engineering and product teams to ensure customer-facing platforms meet and exceed SLAs.
The Sr. Manager, SRE will balance hands-on technical leadership with strategic management, guiding the team in automation, observability, incident management, and service scalability while mentoring future leaders.
Key Responsibilities:
- Build, lead, and mentor a team of SREs across multiple regions and time zones.
- Define the long-term vision and roadmap for SRE, aligning with organizational objectives.
- Partner with product and engineering to ensure reliability is embedded in design, development, and operations.
Operational Excellence
- Own the end-to-end reliability of critical customer-facing services.
- Establish and maintain SLOs, SLIs, and error budgets to measure and enforce service quality.
- Drive root cause analysis and problem management for major incidents, ensuring long-term fixes are prioritized.
- Champion adoption of ITIL/OSS processes (incident, change, problem, and capacity management).
Automation & Tooling
- Expand automation in deployment, monitoring, testing, and incident response to reduce toil.
- Oversee observability platforms (e.g., Catchpoint, Grafana, Moogsoft/BigPanda, Prometheus, Datadog, etc.).
- Ensure robust configuration, capacity, and change management practices.
Cross-Functional Collaboration
- Partner with Network Engineering, DevOps, NOC, and Product Engineering on scalable, resilient architecture.
- Support business continuity, disaster recovery, and compliance requirements.
- Engage with vendors and service providers to manage SLAs and performance outcomes.
People Development
- Hire, coach, and develop engineers and managers, creating strong career paths within SRE.
- Foster a culture of reliability, accountability, and continuous improvement.
- Lead succession planning and leadership pipeline development.
Qualifications:
Education & Experience
- Bachelor’s degree in Computer Science, Engineering, or related field (Master’s preferred).
- 10+ years in infrastructure, reliability, or operations engineering roles.
- 5+ years in people leadership with experience managing managers and global teams.
Technical Skills
- Deep expertise in Linux operating systems (administration, performance tuning, troubleshooting, security hardening).
- Strong knowledge of distributed systems, cloud platforms (AWS, GCP, Azure, private cloud), and networking fundamentals.
- Solid background in observability, monitoring, logging, and alerting frameworks.
- Proficiency with automation (Python, Go, Ansible, Terraform, CI/CD pipelines).
- Familiarity with containers (Kubernetes, Docker) and microservices architectures.
- Strong understanding of ITIL/OSS frameworks, SLO/error budget practices, and incident management at scale.
Leadership Skills
- Proven ability to manage large-scale, high-availability environments.
- Strong communication skills with executive presence; able to translate technical topics into business outcomes.
- Demonstrated success in building and maturing high-performing SRE/operations teams.
Preferred Attributes
- Experience in a service provider, CDN, or large-scale SaaS environment.
- Familiarity with compliance and regulatory frameworks (SOC 2, ISO 27001, GDPR).
- Track record of driving cultural transformation toward reliability-first principles.
What We Offer
- Competitive compensation and benefits package.
- Opportunity to shape the future of global reliability engineering at scale.
- Collaborative culture with strong support for innovation and career growth.
We are committed to fostering a workforce where all employees feel a sense of belonging regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education. We believe that our dedication to cultivating a diverse workspace not only allows us to better serve our customers in over 175 countries, but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer.
The base pay range for this position is $175,000 - $215,000.