¡Activa las notificaciones laborales por email!

Principal Site Reliability Developer

Oracle

Zapopan

Presencial

MXN 800,000 - 1,200,000

Jornada completa

Hace 29 días

Descripción de la vacante

A leading tech company in Mexico is seeking a Senior Site Reliability Engineer to lead the design and support of cloud services. The role requires advanced knowledge in Linux systems, Python programming, and distributed systems. Ideal candidates will possess strong skills in automation and have a proven track record in operational excellence. This is an exciting opportunity to influence the design of highly available systems.

Formación

  • Advanced Linux systems administration experience required.
  • Strong programming skills in Python and automation libraries.
  • Deep understanding of distributed systems and networking.

Responsabilidades

  • Lead design, automation, and support of OCI services.
  • Own end-to-end reliability metrics for services.
  • Architect high-availability systems for large-scale deployments.

Conocimientos

Linux systems administration
Python programming
Bash/Shell scripting
Distributed systems knowledge
CI/CD pipelines
Technical problem-solving

Herramientas

Grafana
Prometheus
Terraform
Ansible
Kubernetes

Descripción del empleo

As a senior member of the Site Reliability Engineering (SRE) team, you'll take ownership of highly available systems, influence service design, and work across teams to drive resiliency, automation, and operational excellence. This is a hands-on engineering role where deep infrastructure knowledge meets software engineering expertise, ideal for experienced SREs ready to take the lead.

Qualifications

Career Level - IC4

Responsibilities

What You’ll Do:

  • Lead the design, automation, and support of OCI services with a focus on resiliency, security, scalability, and performance.
  • Own and improve the end-to-end reliability metrics (SLOs, SLAs, KPIs) for your services.
  • Design and implement high-availability architectures and standards for large-scale distributed systems.
  • Serve as the ultimate escalation point for complex operational issues, using a deep understanding of service topologies and interdependencies.
  • Architect and build automation and orchestration tools that reduce manual work and prevent problem recurrence.
  • Collaborate with development teams to improve service designs, optimize deployments, and implement best practices for operational efficiency.
  • Guide technical decision-making and mentor junior SREs and developers across teams.
  • Participate in and lead postmortems, root cause analysis, and preventative design changes.
  • Contribute to capacity planning, demand forecasting, and long-term service scalability strategies.
  • Participate in a rotational on-call schedule to ensure the health and availability of production services.

What We’re Looking For:

  • Advanced experience with Linux systems administration
  • Strong programming skills in Python (with automation libraries)
  • Advanced Bash/Shell scripting
  • Deep understanding of distributed systems, networking, and service architecture
  • Solid knowledge of databases and how they behave in production (SQL or NoSQL)
  • Strong understanding of CI/CD pipelines, Agile methodologies, and DevOps best practices
  • Experience writing and maintaining unit tests and production-grade software
  • Proven ability to lead cross-functional efforts and technical problem-solving in live environments

Nice to Have:

  • Hands-on experience with monitoring and observability tools (Grafana, Prometheus, New Relic, etc.)
  • Familiarity with Oracle Cloud Infrastructure (OCI) or other cloud platforms (AWS, Azure, GCP)
  • Experience with Infrastructure-as-Code (Terraform, Ansible) and container orchestration (Kubernetes)
Consigue la evaluación confidencial y gratuita de tu currículum.
o arrastra un archivo en formato PDF, DOC, DOCX, ODT o PAGES de hasta 5 MB.