Ativa os alertas de emprego por e-mail!

Site Reliability Engineer

Review ALL

Porto Alegre

Teletrabalho

BRL 120.000 - 160.000

Tempo integral

Ontem
Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A leading cloud infrastructure firm in Brazil is seeking a Senior Site Reliability Engineer (SRE) to enhance the reliability of their global bare metal cloud platform. Responsibilities include improving platform performance, designing automation tools, and implementing observability systems. The ideal candidate must have advanced Linux and Kubernetes skills, along with proficiency in scripting languages and incident response. The role offers competitive compensation and opportunities for technical and career growth.

Serviços

Contractor engagement (PJ)
Paid Time Off
Competitive compensation package
Wellness benefit (Wellhub / Gympass equivalent)
Annual performance-based bonus
Flexible working hours
Opportunities for technical and career growth

Responsabilidades

  • Continuously improve platform reliability and performance.
  • Design, build, and maintain tools to automate operational workflows and incident response.
  • Implement and enhance observability systems (monitoring, alerting, tracing).
  • Collaborate with engineering and platform teams to design scalable and resilient systems.
  • Participate in on-call rotations and lead post-incident reviews with a learning-focused approach.
  • Develop and document operational playbooks and processes.
  • Contribute to defining SLOs / SLIs and driving reliability metrics across teams.

Conhecimentos

Fluent verbal and written English communication skills
Advanced experience with Linux / Unix in production environments
Hands-on experience with Kubernetes and container orchestration
Proficiency with IaC tools (e.g., Terraform, Ansible)
Experience with observability stacks (Prometheus, Grafana, Loki, ELK, etc.)
Proficiency with scripting / programming languages such as Bash, Python, Go, or Ruby
Working knowledge of Git and CI / CD pipelines
Experience with incident response and root cause analysis
Knowledge of cloud-native reliability and security best practices
Descrição da oferta de emprego
About the Company

This company operates a global computing platform that enables businesses to programmatically deploy single-tenant Bare Metal instances across multiple regions worldwide.

They are a team of passionate engineers working at the intersection of hardware, software, and network infrastructure, building the fastest, most developer‑centric single‑tenant cloud infrastructure on the market. If you share this passion, this role offers the opportunity to help shape the future of internet‑scale infrastructure.

This position is being managed in partnership with an external recruitment consultancy supporting the company throughout the hiring process.

Summary

The Reliability team is responsible for the health and resilience of the infrastructure powering a global bare metal cloud platform. As a Senior Site Reliability Engineer (SRE), you’ll focus on building reliable, observable, and self‑healing systems at scale.

SREs here operate at the intersection of software engineering and infrastructure — designing tools that automate operations, improve incident response, and enhance observability, ensuring the platform delivers high performance and reliability to customers worldwide.

This role is ideal for engineers passionate about reliability, automation, distributed systems, and bringing cloud‑like experiences to bare metal environments.

Key Responsibilities
  • Continuously improve platform reliability and performance.
  • Design, build, and maintain tools to automate operational workflows and incident response.
  • Implement and enhance observability systems (monitoring, alerting, tracing).
  • Collaborate with engineering and platform teams to design scalable and resilient systems.
  • Participate in on‑call rotations and lead post‑incident reviews with a learning‑focused approach.
  • Develop and document operational playbooks and processes.
  • Contribute to defining SLOs / SLIs and driving reliability metrics across teams.
Skills & Qualifications
  • Fluent verbal and written English communication skills
  • Advanced experience with Linux / Unix in production environments
  • Hands‑on experience with Kubernetes and container orchestration
  • Proficiency with IaC tools (e.g., Terraform, Ansible)
  • Experience with observability stacks (Prometheus, Grafana, Loki, ELK, etc.)
  • Proficiency with scripting / programming languages such as Bash, Python, Go, or Ruby
  • Working knowledge of Git and CI / CD pipelines
  • Experience with incident response and root cause analysis
  • Knowledge of cloud‑native reliability and security best practices
What’s Offered
  • Contractor engagement (PJ)
  • Paid Time Off
  • Competitive compensation package
  • Wellness benefit (Wellhub / Gympass equivalent)
  • Annual performance‑based bonus
  • Flexible working hours
  • Opportunities for technical and career growth
Obtém a tua avaliação gratuita e confidencial do currículo.
ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.