Stuart (DPD Group) is a sustainable last-mile logistics company connecting retailers and e-merchants to a fleet of geolocalised couriers across several countries in Europe.
Our Mission
- We are an impact-driven company aiming to build a more sustainable future for logistics: shared, efficient, and reliable. We strive to set new standards for urban deliveries that address environmental and social challenges while providing a premium delivery experience that is fast, flexible, and convenient.
Our motto : “Make every delivery a moment all of us can truly celebrate!” Over 3000 leading brands across Restaurants, Grocery, Retail & Luxury, eCommerce, and Professional Services partner with us to deliver goods seamlessly. Stuart is a diverse and inclusive company with 700+ employees from 90+ nationalities working across France, Italy, Poland, Portugal, Spain, and the U.K.
With the surge in home delivery services, now is the perfect time for us to make a significant impact. You can help us realize this vision.
We are looking for a
Lead Site Reliability Engineerto be a technical leader for our SRE team, guiding the team technically and enhancing our platform’s robustness, failure handling, and early issue detection through automation, proper alarming, and chaos engineering.The SRE mission
is to maximize platform reliability by reducing incidents and their severity. This involves monitoring services effectively, setting meaningful alarm thresholds, and automating remediation tasks.Reliability is further strengthened by introducing controlled errors (chaos engineering) and testing disaster recovery scenarios. SREs serve as stewards of reliability, providing the necessary technical and documentation tools for other engineering teams.
The SRE team
is newly formed at Stuart, offering you the opportunity to influence its growth. You will be part of the Infrastructure department’s Reliability area, alongside the Engineering Support team, Cloud Engineering, Security, and IT.What will I be doing?
Leading the team as the go-to expert on software reliability.Participating in hiring, community talks, defining team processes, and fostering team culture and growth.Helping engineering teams build reliable, observable, and high-performance products.Driving and assisting other teams in setting and tracking SLOs and SLAs via SLIs.Designing, implementing, and guiding adoption of Stuart’s observability stack.Contributing to system reliability and performance improvements.Writing and automating playbooks for alarms to minimize manual intervention.Documenting best practices and knowledge sharing.Collaborating on incident management with the Engineering Support team.Leading post-mortem analyses and follow-up actions.Advancing chaos engineering initiatives.What do we need from you?
5+ years of experience in a similar role within mission-critical, always-up services.Background in Systems or Software Engineering.Passion for automation to eliminate repetitive tasks.Proven experience leading complex projects.Expertise in troubleshooting Linux and networking issues.Experience with complex Terraform codebases; bonus if you have written a provider.Strong knowledge of AWS, EKS, Kubernetes, and cloud environments.Experience with chaos engineering practices.Enjoyment in teaching, documenting, and sharing best practices.Proactive attitude to identify and resolve issues.Fluency in English, both written and spoken.We understand you may not meet every criterion but sharing this gives you an idea of our ideal candidate profile.
The stuff you wanna know
Family-friendly work-life balance with remote work and flexible hours.Option to work remotely anywhere in Spain.Ticket Restaurant benefit (€11 daily).Unlimited Udemy access for learning and development.Stuart Academy with regular workshops and classes.El anuncio original lo puedes encontrar en Kit Empleo :
J-18808-Ljbffr