Aktiviere Job-Benachrichtigungen per E-Mail!

Senior Site Reliabilty Engineer

Avaloq AG

Zürich

Vor Ort

CHF 100’000 - 130’000

Vollzeit

Heute
Sei unter den ersten Bewerbenden

Zusammenfassung

A leading technology firm in Switzerland is seeking a Senior Site Reliability Engineer to design and manage observability stacks, develop reliability automation, and ensure seamless operations across multi-cloud environments. The ideal candidate is experienced with tools like Prometheus and Grafana and possesses strong skills in incident response and disaster recovery. This role is vital for maintaining the reliability of cloud-native banking platforms.

Qualifikationen

  • Experience in designing observability stacks.
  • Strong background in reliability automation and self-healing systems.
  • Ability to define SLIs and SLOs for reliability improvements.

Aufgaben

  • Design, implement, and manage observability stacks.
  • Develop reliability automation to proactively detect issues.
  • Collaborate with teams on resilient architectures and load balancing.

Kenntnisse

Observability stacks (metrics, logs, traces)
Reliability automation
CI/CD pipelines
Disaster recovery solutions
Cloud-native platforms

Tools

Prometheus
Grafana
OpenTelemetry
AWS
Azure
GCP
Jobbeschreibung

Join our Technology R&D Lab as a Senior Site Reliability Engineer and help shape the operational foundation of a new generation of cloud-native, composable banking platforms. You will design and evolve the systems, automation, and practices that keep our SaaS products reliable, observable, and secure as we scale globally.

You will work closely with Platform, Security, and Product teams to embed reliability and performance into every layer of the stack, ensuring our ability to deploy rapidly and repeatedly so that innovation, performance and stability move in lockstep.

Your key tasks
  • Design, implement, and manage observability stacks (metrics, logs, traces) using tools like Prometheus, Grafana, and OpenTelemetry
  • Develop reliability automation and self-healing systems to detect and remediate issues proactively
  • Establish and monitor SLIs, SLOs, and error budgets to drive data-informed reliability improvements
  • Collaborate with engineering teams to design resilient architectures, load balancing, and capacity management strategies
  • Optimize CI/CD pipelines and deployment automation to reduce operational toil and risk
  • Lead incident response and post-mortem practices to drive continuous learning and system hardening
  • Define cloud-agnostic reliability standards supporting multi-cloud operations across AWS, Azure, and GCP
  • Design and implement robust disaster recovery solutions ensuring seamless 24x7 operations
  • Partner with Security and Compliance teams to ensure operational processes align with regulatory expectations (PCI DSS, SOC2, GDPR)
  • Contribute to a culture of shared ownership, mentoring engineers on reliability, scalability, and observability practices
Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.
eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.