Job Overview
Plan A Technologies is looking for an experienced
SRE/Observability with expertise in Grafana Cloud, Loki, and Prometheus to enhance the reliability, scalability, and performance of an observability platform. In this role, you will work closely with DevOps, software engineers, and infrastructure teams to build and maintain monitoring, logging, and alerting solutions that ensure the health of our systems. This is an exciting job with room for significant career growth.
Please note: you must have at least 5+ years of experience as a SRE and solid experience with Grafana Cloud, Loki, and Prometheus to be considered for this job.
JOB RESPONSIBILITY
- Design, implement, and manage monitoring, logging, and alerting solutions using Grafana Cloud, Loki, and Prometheus.
- Develop and maintain LogQL queries for effective log aggregation, parsing, and alerting.
- Optimize Prometheus metrics collection, storage, and query performance.
- Automate and improve incident response processes, including defining SLIs, SLOs, and SLAs.
- Collaborate with development teams to ensure observability best practices are followed in application and infrastructure design.
- Troubleshoot and resolve performance bottlenecks, log ingestion issues, and metric anomalies.
- Build dashboards in Grafana to provide visibility into key system health indicators.
- Implement highly available, scalable, and resilient monitoring architectures in cloud or hybrid environments.
- Write and maintain Infrastructure-as-Code (IaC) for monitoring and observability stack.
Experience
- 5+ years of experience in Site Reliability Engineering, DevOps, or Observability roles.
- Hands-on experience with Grafana Cloud, Loki, and Prometheus in large-scale environments.
- Strong expertise in LogQL for querying and analyzing logs in Loki.
- In-depth knowledge of PromQL for querying metrics in Prometheus.
- Experience with Grafana dashboards, alerting rules, and integrations.
- Proficiency in Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
- Experience with Terraform, Helm, or Ansible for infrastructure automation.
- Familiarity with SRE principles, SLIs, SLOs, and incident management.
- Experience with distributed systems, microservices, and networking concepts.
- Excellent verbal and written English communication skills.
- Initiative and drive to do great things.
Preferred Qualifications
- Experience with other logging and monitoring tools like Elastic Stack, Datadog, or OpenTelemetry.
- Knowledge of CI/CD pipelines and GitOps methodologies.
- Certifications in Kubernetes or Observability-related fields.
About The Company/Benefits
Plan A Technologies is an American software development and technology advisory firm that brings top-tier engineering talent to clients around the world. Our software engineers tackle custom product development projects, staff augmentation, major integrations and upgrades, and much more. The team is far more hands-on than the giant outsourcing shops, but still big enough to handle major enterprise clients.
Read more about us here: www.PlanAtechnologies.com
Location: Work From Home 100% of the time, or come in to one of our global offices. Up to you.
Great colleagues and an upbeat work environment: You'll join an excellent team of supportive engineers and project managers who work hard but don't ever compete with each other.
Benefits: You’ll get a generous vacation schedule, Brand New Laptop, and other goodies.
If this sounds like you, we'd love to hear from you!