Enable job alerts via email!
Boost your interview chances
Create a job specific, tailored resume for higher success rate.
An established industry player is seeking a Site Reliability Engineer to ensure the stable and secure operation of modern cloud platforms. This role emphasizes automation, incident response, and collaboration with various teams to enhance system performance and reliability. You'll be at the forefront of technology, implementing innovative solutions while enjoying flexible working hours in a remote-first culture. Join a dynamic team that values ownership, trust, and direct impact on customer experiences, and contribute to a culture of continuous improvement and excellence.
As a Site Reliability Engineer (SRE)at Hundertserver, you are responsible for the stable, high-performing, and secure operation of modern cloud platforms. Through automation, monitoring, SLAs, and incident response, you ensure that our systems not only run – but continuously improve. You work closely with customers, development, and infrastructure teams, bring clarity to complex operational issues, and create sustainable solutions – hands-on, pragmatic, and with a high degree of ownership.
Key Responsibilities
Availability & Stability
• Ensuring platform availability according to defined SLOs / SLAs
• Analyzing and resolving incidents & performance issues (including on-call duties)
• Building and maintaining robust alerting, logging, and monitoring setups
• Root cause analysis & implementation of preventive measures
Automation & Infrastructure
• Automating provisioning, scaling, and maintenance (IaC with Terraform, Ansible, etc.)
• Operating and enhancing Kubernetes environments (cloud & on-prem)
• Developing and maintaining self-healing and auto-scaling mechanisms
• Creating and maintaining runbooks & playbooks
Monitoring, Observability & Performance
• End-to-end monitoring with tools like Prometheus, Grafana, Loki, ELK
• Setting up and managing SLIs and SLOs – data-driven platform control
• Performing performance analyses (workloads, traffic, databases) and ongoing optimization
• Setting up & maintaining distributed tracing and logging systems
Security & Operational Hygiene
• Implementing and enforcing security standards (least privilege, TLS, secrets management)
• Regular health checks, updates, and patching
• Ensuring availability through established backup & disaster recovery processes
Collaboration & Consulting
• Close collaboration with development, support, and platform teams
• Consulting customers on operating models, platform metrics & architectural decisions
• Training internal teams on topics such as monitoring, SRE basics & troubleshooting
What You Should Bring
Technical Profile
• Linux expertise (Debian, Ubuntu, RHEL)
• Deep knowledge of Kubernetes – clusters, ingress, operators, Helm, etc.
• Experience with cloud platforms (AWS, Azure, GCP)
• Strong expertise in monitoring stacks (Prometheus, Grafana, Loki, ELK)
• Proficiency in Infrastructure-as-Code (Terraform, Ansible, Puppet)
• Scripting and automation skills (Bash, Python, Go)
• Familiarity with logging, tracing & incident management processes
Soft Skills & Working Style
• Proactive troubleshooting & high quality awareness
• Structured, analytical thinking – solution-oriented and pragmatic
• Excellent communication skills (with customers, developers, and operations)
• Focus on sustainability & automation rather than firefighting
• Willingness to participate in on-call rotations (standby, SLA windows)
Nice to Have
• Certifications such as CKA / CKS / AWS DevOps or equivalent
• Experience with GitOps, ArgoCD, or Policy-as-Code
• Knowledge of FinOps / cost optimization in cloud platforms
What You Can Expect at Hundertserver
• Real development – in technology, methodology & culture
• Modern platforms & tools – with room for your own ideas
• Ownership & trust – we work in partnership, not through hierarchy
• Flexible working hours & a remote-first culture
• Hands-on mentality & direct customer impact