- Compensation: MXN 45,000 - MXN 70,000 - monthly
Company Description
MedTrainer is an innovator in the healthcare industry, changing the landscape of technology offerings with its Platform Solution, comprised of our proprietary Learning Management System (LMS), our core focus on Compliance Training, and our Managed Services offering in Credentialing and Compliance Management.
We impact thousands of healthcare providers, and we are building the future of healthcare through innovation, scale, and collaboration.
Job Description
Looking for a Site Reliability Engineer who can build, scale, maintain, and monitor highly available, secure, and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs, error budgets, incident response, postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance, uptime, and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient, automated, and observable systems.
Qualifications
- Bachelor's in Computer Science, equivalent degree, or equivalent professional experience.
- 3+ years working on distributed systems and cloud operations.
- Strong hands‑on experience with at least two major cloud providers (Azure, AWS, GCP) and their managed Kubernetes services.
- Deep experience architecting and/or operating large Kubernetes clusters: workload identity, networking, storage, autoscaling, upgrades, security, and multi‑tenancy.
- Container expertise (Docker/OCI), packaging and configuration, and service mesh experience is a plus.
- Advanced GitHub Actions expertise: reusable workflows/composites, concurrency/queueing, environments and approvals, OIDC federation, artifacts, caching, dependency review, and policy/as‑code.
- Strong Python skills (required) for Pulumi‑based IaC, tooling, and automation; Golang knowledge is a plus.
- Familiarity with CI/CD, change management, and experience in progressive delivery.
- Observability stack experience and alerting practices tied to SLOs.
- Configuration of cloud‑native networking, storage, Linux, security controls, and cost governance.
- Experience migrating and scaling infrastructure across clouds.
- Relevant certifications (e.g., CKA) are a plus.
- Advanced English (optional)
Responsibilities
- Design, build, and operate production‑grade Kubernetes (AKS) clusters and supporting services with high availability, security, and cost optimization.
- Architect, implement, and maintain CI/CD using GitHub Actions (advanced), including reusable workflows, matrices, environments, required approvals, OIDC‑based cloud auth, self‑hosted runners, and policy controls.
- Define, codify, and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components, enforce code reviews, testing, and documentation.
- Develop and maintain configuration management with Ansible (roles, collections, inventories, playbooks) for OS, middleware, and app operations.
- Implement progressive delivery and deployment strategies (blue/green, canary, feature flags) and automate rollback/roll‑forward based on health checks and SLOs.
- Establish comprehensive observability (metrics, logs, traces, profiles) with alerting tied to SLIs; drive capacity planning, performance tuning, and chaos/resiliency testing.
- Lead incident management and on‑call response; coordinate triage, communication, mitigation, root‑cause analysis, and follow‑through on corrective actions.
- Partner with product and engineering to design for reliability (readiness/liveness probes, graceful shutdown, backpressure, retries/timeouts, circuit breakers).
- Implement security best practices (least privilege, secrets management) and ensure compliance with internal policies and audits.
- Continuously review existing systems, eliminate toil via automation, reduce technical debt, and document operational runbooks and standards.
Essential technologies and/or skills:
- Exceptional problem‑solving, with the ability to anticipate and remediate issues before they affect business productivity.
- Proven experience handling production environments and being available for emergencies.
- Clear, calm communication with technical and non‑technical audiences.
- Passion for detail and a structured, methodical mindset in design, execution, and documentation.
- Professional, positive approach with strong ethics and high working morale.
- Curiosity to learn, bias for automation, and a true can‑do attitude.
- Version control tools (Git/GitHub)
- Continuous Integration servers (GitHub Actions as primary)
- Configuration management (Ansible)
- Containers (Docker/OCI)
- Monitoring and analytics (metrics/logs/traces, APM, alerting)
- Secrets management and security scanning/signing
- Incident management and on‑call tooling
- Python (scripting level)
- MySQL
Additional Information
What We Offer
- Competitive monthly net salary: $45,000 – $70,000 MXN.
- 100% remote work from anywhere in Mexico.
- Major Medical Insurance and healthcare coverage.
- Home office and ergonomics support (internet, electricity, office chair).
- Professional development opportunities, including English classes.
- Wellness benefits such as TotalPass gym discounts.
- Savings plan.
- Paid time off, including personal days.
- A collaborative, international, and growth‑oriented environment.
All your information will be kept confidential according to EEO guidelines.