Ativa os alertas de emprego por e-mail!

Senior Site Reliability Engineer (SRE)

Avra

São Paulo

Teletrabalho

BRL 80.000 - 120.000

Tempo integral

Ontem

Torna-te num dos primeiros candidatos

Melhora as tuas possibilidades de ir a entrevistas

Cria um currículo adaptado à oferta de emprego para teres uma taxa de sucesso superior.

Resumo da oferta

A leading company in deep tech AI is seeking a Senior Site Reliability Engineer to enhance their infrastructure. You will ensure reliability, scalability, and security while working with advanced technologies in a remote-first environment. Join a dynamic team and make a significant impact on AI-powered business intelligence in Latin America.

Serviços

Unlimited Vacation

National Health Plan

Generous Parental Leave

Equity Participation

Qualificações

5+ years in Site Reliability Engineering or DevOps roles.
3 years of hands-on Kubernetes experience in production.

Responsabilidades

Design and implement fault-tolerant systems across multi-cloud environments.
Develop monitoring and alerting systems to ensure high uptime.

Conhecimentos

Kubernetes

Problem-Solving

Collaboration

Formação académica

Bachelor's Degree

Ferramentas

Docker

Terraform

Prometheus

Grafana

About Avra

Avra is a deep tech data intelligence platform powered by foundational AI that translates the complexity of SMEs into strategic decisions for large enterprises. We develop our own foundational models from the ground up—without relying on third-party solutions—to deliver innovative insights that empower some of the leading banks and fintechs across Latin America. Founded in 2024 by Bruno Alano (ex-OpenAI) and Viviane Meister, our team brings together expertise from NVIDIA, Palantir, Google, and more to drive real impact.

About the Role

As a Senior Site Reliability Engineer at Avra, you will be responsible for designing, building, and maintaining the infrastructure that powers our AI platform. You will play a crucial role in ensuring the reliability, scalability, and security of our systems as we process vast amounts of data and deliver real-time insights. Working closely with our engineering and data science teams, you will create resilient infrastructure that supports our heterogeneous graph neural networks and knowledge graph processing capabilities.

Responsibilities

Platform Reliability: Design and implement highly available, fault-tolerant systems across our multi-cloud environment (AWS and GCP) that support our graph processing and AI inference workloads.
Kubernetes Platform Engineering: Design, implement, and maintain our production Kubernetes environments on GKE and AWS, ensuring high availability, scalability, and security for our graph processing and AI inference workloads.
Observability & Monitoring: Develop comprehensive monitoring, alerting, and logging systems to ensure 99.9%+ uptime for critical services and provide visibility into system performance.
Infrastructure as Code: Create and maintain infrastructure as code using Terraform to automate provisioning and configuration management.
Performance Optimization: Identify and resolve performance bottlenecks in our distributed systems, particularly around graph processing and real-time inference workflows.
Security Engineering: Collaborate with security teams to implement robust security practices, supporting our ISO 27001 and NIST CSF 2.0 certification efforts.
CI/CD Pipeline Enhancement: Improve and maintain our continuous integration and deployment pipelines to support rapid, reliable software delivery.
Incident Response: Lead incident response efforts, conduct post-mortems, and implement systems to prevent recurrence of issues.

You Stand Out If

You have experience building and maintaining infrastructure for data-intensive or AI applications, particularly those involving graph processing or machine learning.
You have DEEP expertise with Kubernetes, including advanced concepts such as custom controllers, operators, networking policies, and multi-cluster management.
You excel at designing scalable, distributed systems that can handle terabytes of data and millions of requests.
You are proficient with cloud orchestration tools like Kubernetes and have experience managing deployments across AWS and GCP environments.
You have significant experience with GKE (Google Kubernetes Engine) and EKS (Amazon Elastic Kubernetes Service) in production environments.
You have implemented robust observability solutions and can effectively troubleshoot complex system failures.
You practice a security-first mindset and have experience implementing infrastructure security controls.
You are passionate about automation and eliminating toil through effective tooling.

Qualifications

Experience: 5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles, with at least 3 years of hands-on Kubernetes experience in production environments.
Kubernetes Expertise: Proven experience managing Kubernetes at scale, including cluster architecture, security hardening, resource optimization, and upgrade management.
Technical Skills: Proficiency in programming (Go, Python, or similar), cloud platforms (AWS, GCP), containerization (Docker, Kubernetes), and monitoring technologies (OpenTelemetry, Prometheus, Grafana, ELK stack, etc.).
System Design: Strong understanding of distributed systems design, failure modes, and mitigation strategies.
Problem-Solving: Exceptional debugging skills and the ability to troubleshoot complex issues across the entire technology stack.
Collaboration: Excellent communication skills and ability to work effectively with cross-functional teams in a remote environment.

Why Join Avra?

Cutting-Edge Technology: Build infrastructure for a deep tech AI platform that processes data from millions of Brazilian companies to enable better business decisions.
Competitive Compensation: Attractive salary, equity participation, and full transparency in our compensation structure.
Direct Impact: Work closely with the founders to shape the infrastructure vision of a fast-growing startup.
Technical Challenges: Solve complex problems around graph processing, real-time inference, and large-scale data systems.
Flexible Work Culture: Enjoy the benefits of 100% remote work with access to an office in São Paulo, unlimited vacation, and a comprehensive benefits package including a national health plan and generous parental leave.

If you are passionate about building reliable, scalable infrastructure for AI systems and want to help us revolutionize how businesses make decisions about SMEs in Brazil, we'd love to hear from you. Apply now to join Avra and help us build the future of AI-powered business intelligence in Latin America.

Obtém a tua avaliação gratuita e confidencial do currículo.

ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.

Ofertas semelhantes

Senior Site Reliability Engineer (Remote-Brazil)

Loadsmart, Inc.

São Paulo

São Paulo

Presencial

USD 80,000 - 120,000

Hoje

Torna-te num dos primeiros candidatos