Enable job alerts via email!

(M) Staffing – 3x Operations Engineer

Believe Solutions

United States

Remote

USD 80,000 - 120,000

Full time

2 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative company is seeking a Kubernetes On-Premise Operations Engineer to manage their infrastructure. This role involves proactive monitoring, troubleshooting, and ensuring high availability of systems. You will collaborate closely with Level 3 Engineers to maintain seamless production operations. Your expertise in Kubernetes, networking, and monitoring tools like Prometheus and Grafana will be crucial. Join a dynamic team where your contributions will enhance system stability and performance, making a significant impact in a forward-thinking environment.

Qualifications

  • 5+ years in Operations, SRE, or DevOps roles.
  • Strong troubleshooting skills in Networking and Monitoring tools.

Responsibilities

  • Apply patches and updates, monitor and troubleshoot performance issues.
  • Participate in on-call rotation and respond to incidents.

Skills

Kubernetes
Troubleshooting
Networking
Prometheus
Grafana
Loki
Ansible
Helm
CI/CD
Incident Management

Tools

Prometheus
Grafana
Loki
Helm
Ansible
Terraform

Job description

Location:Remote (only Bolivia candidates)
Type: Full-Time
Project Scope: Iridium Panama (end of 2025)

We are seeking a Kubernetes On-Premise Operations Engineer to manage and maintain our on-premise Kubernetes infrastructure. This role is focused on day-to-day operations, proactive monitoring, troubleshooting, and ensuring high availability and system stability. The engineer will collaborate closely with Level 3 Engineers who provide the infrastructure backbone, ensuring seamless and reliable production operations.

Scope of Applications Supported

Tigo Sports – Available in 6 countries

KannelGateway – Used across 9 countries

Key Responsibilities

Apply patches and updates

Monitor and troubleshoot performance issues

Incident Management & On-Call Support

Participate in on-call rotation

Respond to incidents, perform root cause analysis (RCA), and document resolutions

Networking & Ingress Management

Operate and troubleshoot Cilium, Nginx Ingress Controller, and Traefik

Storage & Databases

Support and maintain NFS, MongoDB, MySQL, PostgreSQL ensuring performance and data integrity

Observability & Monitoring

Manage Prometheus, Grafana, and Loki for proactive alerting and system logging

Automation & Configuration Management

Use Helm, Ansible, and CI/CD pipelines to apply and manage infrastructure configurations

Production Deployments

Execute, monitor, and manage production deployments with proper rollback strategies

OS & Security Management

Maintain Ubuntu-based systems, ensuring they are patched, secure, and performant

Requirements

5+ years in Operations, SRE, or DevOps roles

Strong troubleshooting skills in:

Networking

Proficient in monitoring tools: Prometheus, Grafana, Loki

Familiar with operational processes, incident management, and runbooks

Experience with Helm, Ansible, and optionally Terraform

Prior experience with production on-call support and incident resolution

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Practice Manager

Lexington Medical Center

Columbia

On-site

USD 60,000 - 100,000

5 days ago
Be an early applicant

Senior Security Specialist - HVA Analyst

Planned Systems International, Inc.

Remote

USD 80,000 - 120,000

12 days ago