¡Activa las notificaciones laborales por email!

(AP686) - Lead Sre Engineer

Stuart

Alicante

A distancia

EUR 60.000 - 90.000

Jornada completa

Hace 3 días
Sé de los primeros/as/es en solicitar esta vacante

Descripción de la vacante

A leading logistics company is seeking a Lead Site Reliability Engineer to build a robust SRE team focused on maximizing platform reliability. This role involves guiding technical operations, collaborating on observability, and promoting chaos engineering practices. The ideal candidate has over 5 years of experience in mission-critical environments and a strong background in software engineering. This position offers remote work flexibility throughout Spain.

Servicios

Family-friendly work-life balance
Remote work and flexible hours
Daily Ticket Restaurant (€11)
Unlimited Udemy access
Workshops through Stuart Academy

Formación

  • 5+ years of experience in a similar role dealing with mission-critical services.
  • Experience leading complex projects and teams.
  • Fluency in English, written and spoken.

Responsabilidades

  • Lead the SRE team and manage software reliability issues.
  • Participate in hiring and team culture development.
  • Define and track SLOs and SLAs while driving observability.
  • Contribute to incidents management and chaos engineering practices.

Conocimientos

Cloud environments
Kubernetes
Linux troubleshooting
Terraform
Chaos engineering
Automation
Communication skills
Proactive problem-solving

Educación

Background in Systems or Software Engineering

Descripción del empleo

Stuart (DPD Group) is a sustainable last-mile logistics company connecting retailers and e-merchants to a fleet of geolocalised couriers across Europe.

Our Mission

  • We aim to build the future of logistics for a more sustainable world: shared, efficient, and reliable. We are committed to creating a new standard for urban deliveries that meet environmental and social challenges while offering a premium delivery experience blending speed, flexibility, and convenience.

Our motto: "Make every delivery a moment all of us can truly celebrate!" Over 3000 leading brands partner with us across Restaurants, Grocery, Retail & Luxury, eCommerce, and Professional Services to deliver various goods at the tap of a button. Stuart is a diverse and inclusive company with 700+ employees from 90+ nationalities working across France, Italy, Poland, Portugal, Spain, and the U.K.

It’s the right moment to make an impact on millions of people as home delivery services hit record highs. You can help us fulfill our vision.

We are looking for a

  • Lead Site Reliability Engineer
  • to be a technical leader for our SRE team. You will guide the team technically, helping improve platform robustness, handle failures gracefully, and detect issues early through automation, proper alarming, and chaos engineering.

The SRE mission is to maximize platform reliability by reducing incidents and their severity. This involves effective monitoring with meaningful thresholds, automating remediation, and introducing controlled errors for testing disaster recovery scenarios. SREs steward reliability, providing tools and documentation for other engineering teams.

The SRE team is new at Stuart, offering the opportunity to influence its growth. You will be part of the Infrastructure department under the Reliability area, alongside the Engineering Support team, Cloud Engineering, Security, and IT.

What will I be doing?

  • Leading the team technically and serving as the go-to for software reliability issues.
  • Participating in departmental initiatives such as hiring, community talks, and process definition to foster team culture and growth.
  • Helping engineering teams build reliable, observable, and performant products.
  • Driving the definition and tracking of SLOs and SLAs via SLIs.
  • Designing, implementing, and guiding adoption of Stuart’s observability stack.
  • Contributing to system reliability and performance improvements.
  • Creating playbooks for alarms and automating responses to minimize manual intervention.
  • Documenting best practices and knowledge sharing.
  • Collaborating on incident management with the Engineering Support team.
  • Leading post-mortem analyses and follow-up actions.
  • Advancing chaos engineering practices within the organization.

What do we need from you?

  • 5+ years of experience in a similar role within mission-critical, always-up services.
  • Background in Systems or Software Engineering.
  • Passion for automation and reducing repetitive tasks.
  • Proven experience leading complex projects.
  • Expertise in troubleshooting Linux and networking issues.
  • Experience with complex Terraform codebases; bonus if you’ve written providers.
  • Strong knowledge of cloud environments and Kubernetes, especially AWS & EKS.
  • Experience with chaos engineering practices.
  • Excellent teaching, documentation, and communication skills.
  • Proactive attitude to identify and resolve issues.
  • Fluent in English, both written and spoken.

Don’t worry, you don’t need to meet every item—just the core experience and mindset.

The stuff you wanna know

  • Family-friendly work-life balance with remote work and flexible hours.
  • Option to work remotely anywhere in Spain.
  • Ticket Restaurant (€11 daily), unlimited Udemy access, Stuart Academy workshops, and more.

El anuncio original lo puedes encontrar en Kit Empleo: J-18808-Ljbffr

Consigue la evaluación confidencial y gratuita de tu currículum.
o arrastra un archivo en formato PDF, DOC, DOCX, ODT o PAGES de hasta 5 MB.