Aktiviere Job-Benachrichtigungen per E-Mail!

Senior Site Reliabilty Engineer

Avaloq AG

Zürich

Vor Ort

CHF 100’000 - 130’000

Vollzeit

Heute

Sei unter den ersten Bewerbenden

Zusammenfassung

A leading technology firm in Switzerland is seeking a Senior Site Reliability Engineer to design and manage observability stacks, develop reliability automation, and ensure seamless operations across multi-cloud environments. The ideal candidate is experienced with tools like Prometheus and Grafana and possesses strong skills in incident response and disaster recovery. This role is vital for maintaining the reliability of cloud-native banking platforms.

Qualifikationen

Experience in designing observability stacks.
Strong background in reliability automation and self-healing systems.
Ability to define SLIs and SLOs for reliability improvements.

Aufgaben

Design, implement, and manage observability stacks.
Develop reliability automation to proactively detect issues.
Collaborate with teams on resilient architectures and load balancing.

Kenntnisse

Observability stacks (metrics, logs, traces)

Reliability automation

CI/CD pipelines

Disaster recovery solutions

Cloud-native platforms

Tools

Prometheus

Grafana

OpenTelemetry

AWS

Azure

GCP

Join our Technology R&D Lab as a Senior Site Reliability Engineer and help shape the operational foundation of a new generation of cloud-native, composable banking platforms. You will design and evolve the systems, automation, and practices that keep our SaaS products reliable, observable, and secure as we scale globally.

You will work closely with Platform, Security, and Product teams to embed reliability and performance into every layer of the stack, ensuring our ability to deploy rapidly and repeatedly so that innovation, performance and stability move in lockstep.

Your key tasks

Design, implement, and manage observability stacks (metrics, logs, traces) using tools like Prometheus, Grafana, and OpenTelemetry
Develop reliability automation and self-healing systems to detect and remediate issues proactively
Establish and monitor SLIs, SLOs, and error budgets to drive data-informed reliability improvements
Collaborate with engineering teams to design resilient architectures, load balancing, and capacity management strategies
Optimize CI/CD pipelines and deployment automation to reduce operational toil and risk
Lead incident response and post-mortem practices to drive continuous learning and system hardening
Define cloud-agnostic reliability standards supporting multi-cloud operations across AWS, Azure, and GCP
Design and implement robust disaster recovery solutions ensuring seamless 24x7 operations
Partner with Security and Compliance teams to ensure operational processes align with regulatory expectations (PCI DSS, SOC2, GDPR)
Contribute to a culture of shared ownership, mentoring engineers on reliability, scalability, and observability practices

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.

eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.