Enable job alerts via email!

SRE Manager

Fortinet

California, Sunnyvale (MO, CA)

On-site

USD 120,000 - 160,000

Full time

8 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a highly skilled Site Reliability Engineering (SRE) Manager to lead a dynamic team focused on building scalable and secure infrastructure. This role involves enhancing system reliability, automating operations, and driving best practices in infrastructure management. The ideal candidate will possess a strong background in software engineering and cloud infrastructure, with a passion for problem-solving and operational excellence. Join a collaborative environment where you can make a significant impact on the reliability and performance of systems serving thousands of B2B customers worldwide.

Qualifications

7+ years of experience in Site Reliability Engineering or DevOps roles.
Extensive experience with Kubernetes and cloud-managed services.
Strong cross-team communication skills and leadership experience.

Responsibilities

Lead and mentor a team of Site Reliability Engineers.
Develop strategies to improve system reliability and automation.
Manage cloud-based infrastructure and ensure best practices.

Skills

Site Reliability Engineering

DevOps

Software Engineering

Infrastructure as Code

Kubernetes

AWS

Python

Golang

Observability Tools

Education

Bachelor's degree in Computer Science

Master's degree in Engineering

Tools

Terraform

Atlantis

ArgoCD

Prometheus

Grafana

Splunk

ELK Stack

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers.

Our team members enjoy solving complex problems, and obsess over getting the details right. We love what we do and are proud of our work to secure clouds and container environments for thousands of B2B customers worldwide.

We are looking for a highly skilled Site Reliability Engineering (SRE) Manager to lead our SRE team in building scalable, reliable, and secure infrastructure that ensures the highest levels of availability and performance.

Job Summary:

As an SRE Manager, you will be responsible for leading a team of Site Reliability Engineers who design, build, and maintain resilient systems. You will play a critical role in enhancing system reliability, improving incident response, automating operations, and driving best practices in infrastructure management. The ideal candidate will have a strong background in software engineering, cloud infrastructure, and operational excellence.

Key Responsibilities:

Lead, mentor, and grow a team of Site Reliability Engineers.

Develop and implement strategies to improve system reliability, observability, and automation.

Establish and maintain SLIs, SLOs, and SLAs to ensure high availability and performance.

Drive incident response, root cause analysis, and postmortem processes.

Collaborate with software engineering teams to improve application architecture and resiliency.

Manage cloud-based infrastructure (AWS) and ensure best practices for security and scalability.

Collaborate with cross-functional teams, including developers, security, and product teams.

Stay updated with industry trends and introduce new tools and methodologies to enhance reliability and efficiency.

Required Qualifications:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

7+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles.

3+ years of experience in a leadership or managerial role within an SRE or DevOps team.

Extensive experience with Infrastructure as Code (Terraform, etc.), as well as supporting tooling (Atlantis, ArgoCD, etc.).

Extensive experience with Kubernetes and supporting tooling (Helm, operators, etc.).

Extensive experience with a variety of cloud-managed services and providers.
- AWS: EKS, EC2, S3, RDS, Secrets Manager, etc.

Experience building production-quality cloud infrastructure that enables reliable and rapid deployment of microservices with effective monitoring and built-in high availability and/or fault tolerance.

Strong cross-team communication skills.

Experience with the building blocks of large-scale systems, including load balancing, distributed/cloud computing, containers, instrumentation, and monitoring.

Knowledge of cloud networking, including VPC configuration and cross-cloud connectivity.

Familiarity with one or more programming languages (Python, Golang, etc.).

Deep understanding of observability tools (Prometheus, Grafana, Splunk, ELK Stack).

Excellent communication and collaboration abilities.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.