Job Description
Lead the offshore engineering team in building and deploying a unified observability platform across non-prod and production environments (GCP/Anthos clusters). The goal is to provide end-to-end visibility, using a modernized, open-source OpenTelemetry (oTEL) stack while working to replace legacy monitoring and correlation tools.
Key Responsibilities
- Platform Architecture & Deployment: Lead the deployment of the end-to-end open-source observability stack on GCP/Anthos Kubernetes clusters.
- Logging Layer Implementation: Design and configure the logging pipeline using DataPrepper (for data enrichment), OpenSearch (for indexing and storage), OpenSearch Dashboards, and ElastAlert.
- Metrics & Distributed Tracing: Implement the observability layer to capture metrics and traces from target API clusters using OTEL Agents and Collectors, routing them to Prometheus and Grafana Tempo.
- UX & Infrastructure Monitoring: Set up Synthetic and Real User Monitoring (RUM) using K6 Synthetics, Blackbox Exporter, and Grafana Faro, alongside infrastructure monitoring via NodeExporter and Kube-state-metrics.
- Configuration as Code (CaC) & GitOps: Ensure all observability infrastructure, OTEL collector configurations, and dashboards are version-controlled in Git and deployed automatically using Jenkins pipelines and Terraform.
- Dashboarding & Alerting: Develop unified, SLO/SLI-driven Grafana dashboards covering the four golden signals (Latency, Errors, Traffic, Saturation). Configure alert deduplication, correlation, and routing using Prometheus and Alertmanager.
- Migration & Parity Validation: Run parallel validations to map and migrate existing New Relic dashboards/alerts and Splunk logs to the new Grafana/OpenSearch stack, ensuring full parity before legacy platform turn-down.
- Team Leadership: Mentor offshore engineers, identify and bridge skill gaps within the team, and collaborate closely with onshore leads and enterprise architects.
Required Skills & Qualifications
- Core Observability: 8+ years of experience in observability, APM, and logging. Deep expertise in OpenTelemetry (OTel) instrumentation (Agents/Collectors) is mandatory.
- Open-Source Stack: Strong hands-on experience with Prometheus, Grafana, Tempo, and OpenSearch/ELK stack.
- Legacy Tool Migration: Proven experience working with or migrating away from commercial tools like Splunk and New Relic.
- Cloud & Containers: Strong proficiency in Kubernetes orchestration (GCP/Anthos environment experience preferred) and microservices architecture.
- Infrastructure as Code (IaC): Advanced skills in Terraform and CI/CD pipelines (Jenkins) for automated deployments.
- Scripting & Synthetics: Experience with synthetic transaction scripting (K6, Playwright, or Puppeteer).
- Leadership: Demonstrated experience leading offshore technical teams, managing project milestones, and driving architectural best practices. Excellent communication skills for onshore-offshore coordination.