Ativa os alertas de emprego por e-mail!

Senior Site Reliability Engineer (SRE)

Avra

São Paulo

Teletrabalho

BRL 80.000 - 120.000

Tempo integral

Ontem
Torna-te num dos primeiros candidatos

Melhora as tuas possibilidades de ir a entrevistas

Cria um currículo adaptado à oferta de emprego para teres uma taxa de sucesso superior.

Resumo da oferta

A leading company in deep tech AI is seeking a Senior Site Reliability Engineer to enhance their infrastructure. You will ensure reliability, scalability, and security while working with advanced technologies in a remote-first environment. Join a dynamic team and make a significant impact on AI-powered business intelligence in Latin America.

Serviços

Unlimited Vacation
National Health Plan
Generous Parental Leave
Equity Participation

Qualificações

  • 5+ years in Site Reliability Engineering or DevOps roles.
  • 3 years of hands-on Kubernetes experience in production.

Responsabilidades

  • Design and implement fault-tolerant systems across multi-cloud environments.
  • Develop monitoring and alerting systems to ensure high uptime.

Conhecimentos

Kubernetes
Problem-Solving
Collaboration

Formação académica

Bachelor's Degree

Ferramentas

Docker
Terraform
Prometheus
Grafana

Descrição da oferta de emprego

About Avra

Avra is a deep tech data intelligence platform powered by foundational AI that translates the complexity of SMEs into strategic decisions for large enterprises. We develop our own foundational models from the ground up—without relying on third-party solutions—to deliver innovative insights that empower some of the leading banks and fintechs across Latin America. Founded in 2024 by Bruno Alano (ex-OpenAI) and Viviane Meister, our team brings together expertise from NVIDIA, Palantir, Google, and more to drive real impact.

About the Role

As a Senior Site Reliability Engineer at Avra, you will be responsible for designing, building, and maintaining the infrastructure that powers our AI platform. You will play a crucial role in ensuring the reliability, scalability, and security of our systems as we process vast amounts of data and deliver real-time insights. Working closely with our engineering and data science teams, you will create resilient infrastructure that supports our heterogeneous graph neural networks and knowledge graph processing capabilities.

Responsibilities

  • Platform Reliability: Design and implement highly available, fault-tolerant systems across our multi-cloud environment (AWS and GCP) that support our graph processing and AI inference workloads.

  • Kubernetes Platform Engineering: Design, implement, and maintain our production Kubernetes environments on GKE and AWS, ensuring high availability, scalability, and security for our graph processing and AI inference workloads.

  • Observability & Monitoring: Develop comprehensive monitoring, alerting, and logging systems to ensure 99.9%+ uptime for critical services and provide visibility into system performance.

  • Infrastructure as Code: Create and maintain infrastructure as code using Terraform to automate provisioning and configuration management.

  • Performance Optimization: Identify and resolve performance bottlenecks in our distributed systems, particularly around graph processing and real-time inference workflows.

  • Security Engineering: Collaborate with security teams to implement robust security practices, supporting our ISO 27001 and NIST CSF 2.0 certification efforts.

  • CI/CD Pipeline Enhancement: Improve and maintain our continuous integration and deployment pipelines to support rapid, reliable software delivery.

  • Incident Response: Lead incident response efforts, conduct post-mortems, and implement systems to prevent recurrence of issues.

You Stand Out If

  • You have experience building and maintaining infrastructure for data-intensive or AI applications, particularly those involving graph processing or machine learning.

  • You have DEEP expertise with Kubernetes, including advanced concepts such as custom controllers, operators, networking policies, and multi-cluster management.

  • You excel at designing scalable, distributed systems that can handle terabytes of data and millions of requests.

  • You are proficient with cloud orchestration tools like Kubernetes and have experience managing deployments across AWS and GCP environments.

  • You have significant experience with GKE (Google Kubernetes Engine) and EKS (Amazon Elastic Kubernetes Service) in production environments.

  • You have implemented robust observability solutions and can effectively troubleshoot complex system failures.

  • You practice a security-first mindset and have experience implementing infrastructure security controls.

  • You are passionate about automation and eliminating toil through effective tooling.

Qualifications

  • Experience: 5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles, with at least 3 years of hands-on Kubernetes experience in production environments.

  • Kubernetes Expertise: Proven experience managing Kubernetes at scale, including cluster architecture, security hardening, resource optimization, and upgrade management.

  • Technical Skills: Proficiency in programming (Go, Python, or similar), cloud platforms (AWS, GCP), containerization (Docker, Kubernetes), and monitoring technologies (OpenTelemetry, Prometheus, Grafana, ELK stack, etc.).

  • System Design: Strong understanding of distributed systems design, failure modes, and mitigation strategies.

  • Problem-Solving: Exceptional debugging skills and the ability to troubleshoot complex issues across the entire technology stack.

  • Collaboration: Excellent communication skills and ability to work effectively with cross-functional teams in a remote environment.

Why Join Avra?

  • Cutting-Edge Technology: Build infrastructure for a deep tech AI platform that processes data from millions of Brazilian companies to enable better business decisions.

  • Competitive Compensation: Attractive salary, equity participation, and full transparency in our compensation structure.

  • Direct Impact: Work closely with the founders to shape the infrastructure vision of a fast-growing startup.

  • Technical Challenges: Solve complex problems around graph processing, real-time inference, and large-scale data systems.

  • Flexible Work Culture: Enjoy the benefits of 100% remote work with access to an office in São Paulo, unlimited vacation, and a comprehensive benefits package including a national health plan and generous parental leave.

If you are passionate about building reliable, scalable infrastructure for AI systems and want to help us revolutionize how businesses make decisions about SMEs in Brazil, we'd love to hear from you. Apply now to join Avra and help us build the future of AI-powered business intelligence in Latin America.

Obtém a tua avaliação gratuita e confidencial do currículo.
ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.

Ofertas semelhantes

Senior Site Reliability Engineer (Remote-Brazil)

Loadsmart, Inc.

São Paulo

Teletrabalho

USD 80,000 - 120,000

Há 6 dias
Torna-te num dos primeiros candidatos

Senior Site Reliability Engineer (SRE)

Avra

São Paulo

Teletrabalho

BRL 80,000 - 130,000

Há 17 dias

Senior Site Reliability Engineer (Remote-Brazil)

Loadsmart

São Paulo

Teletrabalho

USD 80,000 - 120,000

Há 17 dias

Senior Site Reliability Engineer (SRE)

Avra

São Paulo

Teletrabalho

BRL 80,000 - 120,000

Há 18 dias

Site Reliability Engineer

NinjaOne

São Paulo

Teletrabalho

USD 60,000 - 100,000

Há 17 dias

Site Reliability Engineer (Senior/Lead) ID35136

AgileEngine

São Paulo

Presencial

USD 80,000 - 120,000

Hoje
Torna-te num dos primeiros candidatos

Senior Site Reliability Engineer

Internetwork Expert

São Paulo

Teletrabalho

BRL 80,000 - 120,000

Há 30+ dias

Site Reliability Engineer (Senior/Lead) ID35136

AgileEngine, LLC.

Porto Alegre

Presencial

USD 100,000 - 130,000

Ontem
Torna-te num dos primeiros candidatos

Site Reliability Engineer (Senior) ID35139

AgileEngine

São Paulo

Híbrido

USD 60,000 - 100,000

Há 14 dias