Enable job alerts via email!

Site Reliability Engineer

Logile, Inc.

Khordha

On-site

INR 15,00,000 - 25,00,000

Full time

Today
Be an early applicant

Job summary

A technology solutions provider is seeking a Site Reliability Engineer to ensure the reliability and performance of infrastructure. The ideal candidate will have experience with monitoring tools like Prometheus and Grafana, cloud expertise, and strong automation skills. This onsite role requires collaboration with teams and availability for overlap with US working hours. Compensation is competitive with industry standards.

Benefits

Shift allowances
Home pickup and drop off

Qualifications

  • 2 - 5 years of experience with monitoring, logging, and tracing tools.
  • Proficient in Linux system administration and networking fundamentals.
  • Solid skills in infrastructure automation.

Responsibilities

  • Design and manage observability systems.
  • Define and maintain SLAs, SLOs, and SLIs.
  • Build automation for infrastructure and incident response.

Skills

Monitoring tools (Prometheus, Grafana)
Cloud expertise (AWS, Azure, GCP)
Linux system administration
Infrastructure automation (Terraform, Ansible)
Programming (Python, Go, Bash)
Kubernetes
CI/CD practices

Tools

Terraform
Ansible
ELK/EFK
Jaeger
Job description
Company Overview

Logile is the leading retail labor planning, workforce management, inventory management and store execution provider deployed in thousands of retail locations across North America, Europe, Australia, and Oceania.

Our proven AI, machine-learning technology and industrial engineering accelerate ROI and enable operational excellence with improved performance and empowered employees. Retailers worldwide rely on Logile solutions to boost profitability and competitive advantage by delivering the best service and products at optimal cost.

From labor standards development and modeling to unified forecasting, storewide scheduling, and time and attendance, to inventory management, task management, food safety, and employee self-service — we transform retail operations with a unified store-level solution. Gain the Advantage with The Logic of Retail. One Platform for store planning, scheduling and execution.

For more information, visit www.logile.com.

Job Summary

We are seeking a motivated and experienced Site Reliability Engineer (SRE) to join our dynamic engineering team. The ideal candidate will have a strong background to ensure the reliability, scalability, and performance of our infrastructure and applications. The SRE will focus on building robust monitoring systems, automating operations, and bridging the gap between development and operations to achieve high service availability.

Key Responsibilities
  • Design, implement, and manage observability systems (Prometheus, Grafana, ELK/EFK, Jaeger, Open Telemetry).
  • Define and maintain SLAs, SLOs, and SLIs for services, ensuring reliability goals are met.
  • Build automation for infrastructure, monitoring, scaling, and incident response using Terraform, Ansible, and scripting (Python/Bash).
  • Collaborate with developers to design resilient and scalable systems following SRE best practices.
  • Lead incident management: monitoring alerts, root cause analysis, postmortems, and continuous improvement.
  • Implement chaos engineering and fault-tolerance testing to validate system resilience.
  • Drive capacity planning, performance tuning, and cost optimization across environments.
  • Ensure security, compliance, and governance in infrastructure monitoring.
Job Location & Schedule
  • This job is an onsite job at Logile Bhubaneswar Office.
  • It is expected that the selected candidate will be available to work with some hours of overlap with US working times.
Required Skills & Experience
  • 2 - 5 years, strong experience with monitoring, logging, and tracing tools (Prometheus, Grafana, ELK, EFK, Jaeger, Open Telemetry, Loki).
  • Cloud expertise: AWS, Azure, or GCP monitoring and reliability practices (CloudWatch, Azure Monitor).
  • Proficiency in Linux system administration and networking fundamentals.
  • Solid skills in infrastructure automation (Terraform, Ansible, Helm).
  • Programming/scripting skills: Python, Go, Bash.
  • Experience with Kubernetes and containerized workloads.
  • Proven track record in CI/CD and DevOps practices.
Preferred Skills
  • Experience with chaos engineering tools (Gremlin, Litmus).
  • Strong collaboration skills to drive SRE culture across Dev & Ops teams.
  • Experience with Agile/Scrum environments.
  • Knowledge of security best practices (DevSecOps).
Compensation And Benefits
  • The compensation and benefits associated for this role is benchmarked against the best in industry and job location
  • Applicable shift allowances and home pick up and drops will be provided by Logile
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.