Aktiviere Job-Benachrichtigungen per E-Mail!

Senior Site Reliability Engineer (SRE)

Hundertserver

Berlin

Remote

EUR 50.000 - 90.000

Vollzeit

Vor 16 Tagen

Erhöhe deine Chancen auf ein Interview

Erstelle einen auf die Position zugeschnittenen Lebenslauf, um deine Erfolgsquote zu erhöhen.

Zusammenfassung

An established industry player is seeking a Site Reliability Engineer to ensure the stable and secure operation of modern cloud platforms. This role emphasizes automation, incident response, and collaboration with various teams to enhance system performance and reliability. You'll be at the forefront of technology, implementing innovative solutions while enjoying flexible working hours in a remote-first culture. Join a dynamic team that values ownership, trust, and direct impact on customer experiences, and contribute to a culture of continuous improvement and excellence.

Leistungen

Flexible working hours

Remote-first culture

Real development opportunities

Ownership and trust in work

Hands-on mentality

Qualifikationen

Expertise in Linux, Kubernetes, and cloud platforms is essential.
Proficiency in monitoring tools and Infrastructure-as-Code practices.

Aufgaben

Ensure platform availability and stability according to SLOs.
Automate infrastructure provisioning and maintenance using IaC.
Collaborate closely with development and support teams.

Kenntnisse

Linux Expertise

Kubernetes Knowledge

Cloud Platforms (AWS, Azure, GCP)

Monitoring Stacks (Prometheus, Grafana, ELK)

Infrastructure-as-Code (Terraform, Ansible)

Scripting Skills (Bash, Python, Go)

Proactive Troubleshooting

Excellent Communication Skills

Ausbildung

Relevant Technical Certifications (e.g., CKA, AWS DevOps)

Tools

Terraform

Ansible

Prometheus

Grafana

ELK Stack

Kubernetes

Join our Hundertserver Team!

Your Hundertserver mission:

As a Site Reliability Engineer (SRE)at Hundertserver, you are responsible for the stable, high-performing, and secure operation of modern cloud platforms. Through automation, monitoring, SLAs, and incident response, you ensure that our systems not only run – but continuously improve. You work closely with customers, development, and infrastructure teams, bring clarity to complex operational issues, and create sustainable solutions – hands-on, pragmatic, and with a high degree of ownership.

The Main Tasks:

Key Responsibilities
Availability & Stability
• Ensuring platform availability according to defined SLOs / SLAs
• Analyzing and resolving incidents & performance issues (including on-call duties)
• Building and maintaining robust alerting, logging, and monitoring setups
• Root cause analysis & implementation of preventive measures

Automation & Infrastructure
• Automating provisioning, scaling, and maintenance (IaC with Terraform, Ansible, etc.)
• Operating and enhancing Kubernetes environments (cloud & on-prem)
• Developing and maintaining self-healing and auto-scaling mechanisms
• Creating and maintaining runbooks & playbooks

Monitoring, Observability & Performance
• End-to-end monitoring with tools like Prometheus, Grafana, Loki, ELK
• Setting up and managing SLIs and SLOs – data-driven platform control
• Performing performance analyses (workloads, traffic, databases) and ongoing optimization
• Setting up & maintaining distributed tracing and logging systems

Security & Operational Hygiene
• Implementing and enforcing security standards (least privilege, TLS, secrets management)
• Regular health checks, updates, and patching
• Ensuring availability through established backup & disaster recovery processes

Collaboration & Consulting
• Close collaboration with development, support, and platform teams
• Consulting customers on operating models, platform metrics & architectural decisions
• Training internal teams on topics such as monitoring, SRE basics & troubleshooting

You fit to our team when:

What You Should Bring
Technical Profile
• Linux expertise (Debian, Ubuntu, RHEL)
• Deep knowledge of Kubernetes – clusters, ingress, operators, Helm, etc.
• Experience with cloud platforms (AWS, Azure, GCP)
• Strong expertise in monitoring stacks (Prometheus, Grafana, Loki, ELK)
• Proficiency in Infrastructure-as-Code (Terraform, Ansible, Puppet)
• Scripting and automation skills (Bash, Python, Go)
• Familiarity with logging, tracing & incident management processes

Soft Skills & Working Style
• Proactive troubleshooting & high quality awareness
• Structured, analytical thinking – solution-oriented and pragmatic
• Excellent communication skills (with customers, developers, and operations)
• Focus on sustainability & automation rather than firefighting
• Willingness to participate in on-call rotations (standby, SLA windows)

Nice to Have
• Certifications such as CKA / CKS / AWS DevOps or equivalent
• Experience with GitOps, ArgoCD, or Policy-as-Code
• Knowledge of FinOps / cost optimization in cloud platforms

What we offer:

What You Can Expect at Hundertserver
• Real development – in technology, methodology & culture
• Modern platforms & tools – with room for your own ideas
• Ownership & trust – we work in partnership, not through hierarchy
• Flexible working hours & a remote-first culture
• Hands-on mentality & direct customer impact

You're up for it?

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.

eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.