¡Activa las notificaciones laborales por email!

Site Reliability Engineer (SRE)

MedTrainer

Santiago de Querétaro

Híbrido

MXN 400,000 - 600,000

Jornada completa

Hoy

Sé de los primeros/as/es en solicitar esta vacante

Genera un currículum adaptado en cuestión de minutos

Consigue la entrevista y gana más. Más información

Descripción de la vacante

A leading technology provider in healthcare is seeking a Site Reliability Engineer to build, scale, and maintain cloud platforms. The ideal candidate will have 3+ years of experience in cloud operations and Kubernetes, with a strong focus on reliability engineering practices. This position offers a competitive monthly salary of MXN 45,000 to MXN 70,000 and is 100% remote. Additional benefits include major medical insurance, home office support, and professional development opportunities.

Servicios

Major Medical Insurance

Professional development opportunities

Wellness benefits like gym discounts

Paid time off

Home office support

Formación

3+ years working on distributed systems and cloud operations.
Strong hands-on experience with at least two major cloud providers.
Deep experience architecting and/or operating large Kubernetes clusters.

Responsabilidades

Design, build, and operate production-grade Kubernetes clusters.
Architect, implement, and maintain CI/CD using GitHub Actions.
Establish comprehensive observability with alerting tied to SLIs.

Conocimientos

Cloud operations

Kubernetes

Python

GitHub Actions

Ansible

Docker/OCI

Educación

Bachelor's in Computer Science or equivalent

Herramientas

Azure

AWS

GCP

Compensation: MXN 45,000 - MXN 70,000 - monthly

Company Description

MedTrainer is an innovator in the healthcare industry, changing the landscape of technology offerings with its Platform Solution, comprised of our proprietary Learning Management System (LMS), our core focus on Compliance Training, and our Managed Services offering in Credentialing and Compliance Management.

We impact thousands of healthcare providers, and we are building the future of healthcare through innovation, scale, and collaboration.

Job Description

Looking for a Site Reliability Engineer who can build, scale, maintain, and monitor highly available, secure, and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs, error budgets, incident response, postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance, uptime, and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient, automated, and observable systems.

Qualifications

Bachelor's in Computer Science, equivalent degree, or equivalent professional experience.
3+ years working on distributed systems and cloud operations.
Strong hands‑on experience with at least two major cloud providers (Azure, AWS, GCP) and their managed Kubernetes services.
Deep experience architecting and/or operating large Kubernetes clusters: workload identity, networking, storage, autoscaling, upgrades, security, and multi‑tenancy.
Container expertise (Docker/OCI), packaging and configuration, and service mesh experience is a plus.
Advanced GitHub Actions expertise: reusable workflows/composites, concurrency/queueing, environments and approvals, OIDC federation, artifacts, caching, dependency review, and policy/as‑code.
Strong Python skills (required) for Pulumi‑based IaC, tooling, and automation; Golang knowledge is a plus.
Familiarity with CI/CD, change management, and experience in progressive delivery.
Observability stack experience and alerting practices tied to SLOs.
Configuration of cloud‑native networking, storage, Linux, security controls, and cost governance.
Experience migrating and scaling infrastructure across clouds.
Relevant certifications (e.g., CKA) are a plus.
Advanced English (optional)

Responsibilities

Design, build, and operate production‑grade Kubernetes (AKS) clusters and supporting services with high availability, security, and cost optimization.
Architect, implement, and maintain CI/CD using GitHub Actions (advanced), including reusable workflows, matrices, environments, required approvals, OIDC‑based cloud auth, self‑hosted runners, and policy controls.
Define, codify, and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components, enforce code reviews, testing, and documentation.
Develop and maintain configuration management with Ansible (roles, collections, inventories, playbooks) for OS, middleware, and app operations.
Implement progressive delivery and deployment strategies (blue/green, canary, feature flags) and automate rollback/roll‑forward based on health checks and SLOs.
Establish comprehensive observability (metrics, logs, traces, profiles) with alerting tied to SLIs; drive capacity planning, performance tuning, and chaos/resiliency testing.
Lead incident management and on‑call response; coordinate triage, communication, mitigation, root‑cause analysis, and follow‑through on corrective actions.
Partner with product and engineering to design for reliability (readiness/liveness probes, graceful shutdown, backpressure, retries/timeouts, circuit breakers).
Implement security best practices (least privilege, secrets management) and ensure compliance with internal policies and audits.
Continuously review existing systems, eliminate toil via automation, reduce technical debt, and document operational runbooks and standards.

Essential technologies and/or skills:

Exceptional problem‑solving, with the ability to anticipate and remediate issues before they affect business productivity.
Proven experience handling production environments and being available for emergencies.
Clear, calm communication with technical and non‑technical audiences.
Passion for detail and a structured, methodical mindset in design, execution, and documentation.
Professional, positive approach with strong ethics and high working morale.
Curiosity to learn, bias for automation, and a true can‑do attitude.
Version control tools (Git/GitHub)
Continuous Integration servers (GitHub Actions as primary)
Configuration management (Ansible)
Containers (Docker/OCI)
Monitoring and analytics (metrics/logs/traces, APM, alerting)
Secrets management and security scanning/signing
Incident management and on‑call tooling
Python (scripting level)
MySQL

Additional Information

What We Offer

Competitive monthly net salary: $45,000 – $70,000 MXN.
100% remote work from anywhere in Mexico.
Major Medical Insurance and healthcare coverage.
Home office and ergonomics support (internet, electricity, office chair).
Professional development opportunities, including English classes.
Wellness benefits such as TotalPass gym discounts.
Savings plan.
Paid time off, including personal days.
A collaborative, international, and growth‑oriented environment.

All your information will be kept confidential according to EEO guidelines.

Consigue la evaluación confidencial y gratuita de tu currículum.

o arrastra un archivo en formato PDF, DOC, DOCX, ODT o PAGES de hasta 5 MB.