Aktiviere Job-Benachrichtigungen per E-Mail!

Senior Site Reliability Engineer - Neocloud Provider

Hamilton Barnes Associates Limited

Remote

EUR 110.000 - 130.000

Vollzeit

Vor 14 Tagen

Erstelle in nur wenigen Minuten einen maßgeschneiderten Lebenslauf

Überzeuge Recruiter und verdiene mehr Geld. Mehr erfahren

Zusammenfassung

A leading AI cloud provider in Germany is seeking a Senior Site Reliability Engineer to architect and maintain fault-tolerant distributed systems for high-performance GPU workloads. The ideal candidate has strong Linux debugging skills, proficiency in Terraform and Kubernetes, and experience with Slurm job monitoring. This role offers a competitive salary of up to €130,000 gross per year, along with a bonus scheme and company share scheme.

Leistungen

Bonus Scheme

Company share scheme

Qualifikationen

Strong Linux debugging expertise, including network and system-call tracing.
Proficiency with Terraform and Kubernetes, including network policies and scheduling.
Experience with Slurm job monitoring and core configuration.
Solid Python or Go skills, focusing on async/error handling.
Ability to automate workflows and troubleshoot distributed systems.

Aufgaben

Architect and maintain reliable, fault-tolerant distributed systems.
Build and automate deployment, monitoring, capacity planning, and incident-response workflows.
Develop, optimise, and maintain CI/CD pipelines.
Drive incident response and improve system observability.
Partner with teams to optimise service performance and support regional expansion.

Kenntnisse

Linux debugging expertise

Proficiency with Terraform

Proficiency with Kubernetes

Experience with Slurm job monitoring

Solid Python skills

Solid Go skills

Automating workflows and troubleshooting

Do you want to join a leading next-generation AI cloud provider as a Senior Site Reliability Engineer?

You will be joining a Neocloud that is building one of the most advanced GPU and high-performance computing platforms in Europe.

The role offers the chance to help design and maintain the reliability, scale and performance of a growing cloud platform with real engineering challenges.

You will collaborate with highly skilled teams across software, hardware, networking & AI infrastructure, with the autonomy to influence technical direction and build systems that support large-scale compute workloads.

If you are interested in this opportunity and want to learn more, get in touch today.

Responsibilities

Architect and maintain reliable, fault-tolerant, large-scale distributed systems for high-performance GPU and compute workloads.
Build and automate deployment, failover, monitoring, capacity planning, and incident-response workflows.
Develop, optimise, and maintain CI/CD pipelines to enable safe, rapid, and repeatable software delivery.
Drive incident response and root-cause analysis while improving system observability, performance, and long-term stability.
Partner with backend, hardware, and networking teams to optimise service performance, support regional expansion, scale compute clusters, and participate in on-call rotations.

Required Skills & Experience

Strong Linux debugging expertise, including network and system-call tracing.
Proficiency with Terraform and Kubernetes (network policies, scheduling, taints/tolerations).
Experience with Slurm job monitoring and core configuration.
Solid Python or Go skills, covering async/error handling, environment management, and common system/HTTP tooling.
Ability to automate workflows and troubleshoot distributed systems using CLI tools, logs, and scripting.

Salary & Benefits

Up to €130,000 Gross Per Year
Bonus Scheme
Company share scheme

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.

eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.

Top-Städte

Top-Unternehmen

Beliebte Jobs