Key Accountability
- Monitoring Effectiveness – Ensuring the monitoring framework and enhancements are setup to increase Pro-active identification & resolution prior to customer impact.
- Setup & maintain centralized Monitoring Configuration by code
- Consistently drive the alert volume down and eliminate false alerts
- Setup advanced monitoring alerts for golden signals i.e. Latency, Errors, Throughputs & Saturation.
- Transform from traditional CPU, Memory symptomatic monitors to more advanced alert co-relation pinpointing directly to issues for predictive monitoring
- Create & implement Synthetic or End User Monitoring using Python, Selenium for customer experience monitoring
- Set up API End point monitoring & measure uptime & availability across customers, products & infrastructure endpoints.
- Implement SLOs, SLIs, Error Budgets concepts to measure & setup Maturity model
- Maintain & Manage Code Repository built to scale and security measures
- Leverage Automation to push changes on monitoring tools
- Setup Orchestration mechanism for on-boarding & decommissioning to ensure Operational Readiness
- Setup Dashboards & Create visibility across all Cross-functional teams
- Establish Telemetry for automated collection of data across Metrics, Logs & Traces
- Continuous Analysis on Data to acknowledge gaps and implementing improvements
Minimum Requirements
- Associate’s degree (or equivalent) in Computer Science; Information Technology or related field preferred
- 10-12 years of IT experience with 6 years of Monitoring Experience
- Experience in Administrating Monitoring Tools – AppDynamics, SolarWinds, Grafana, Zabbix, DataDog, ELK Stack etc.
- Hands-on experience on Logs, Metrics, Traces, Parsing, RegEx, Tagging
- Hands-on experience on implementing APM, EUM, Synthetics, API endpoint etc.
- Hands-on experience on integrations with ITSM tools such as Service Now & Jira
- Hands-on experience on Ansible, Python, Selenium, Shell
- Hands-on experience on Enterprise scale of Azure, VM Ware & AWS
- Hands-on experience on creating dashboards and analysis
- Excellent interpersonal, influencing skills, interacting appropriately with colleagues of many technical skill levels, remaining calm and courteous while working in a high-stress situation to resolve problems.
Skills:
- Technical Skills: Monitoring Tool Administration, Logs Indexing & pipeline, Azure, VMWare, Ansible, Python, Selenium, Terraform, Shell, Windows, Linux, GROK parsing
- Problem-solving skills – should be able to devise technical and creative solutions. Use Analytics to understand pattern and pro-actively identify gaps
- Communication skills – Effective communication is key in this role to gather data about problems, prepare detailed notes and reports, and update users with further steps
- Time management – Need to maintain excellent time management skills and should be able to set priorities when handling multiple cases.
- Team collaboration – To routinely work with other functions to resolve user issues, so they need to successfully collaborate with team members and coworkers.
- Highly motivated, hands-on personality.
- Ability to learn quickly in a challenging environment.
Our Values
If you want to know the heart of a company, take a look at their values. Ours unite us. They are what drive our success – and the success of our customers. Does your heart beat like ours? Find out here: Core Values
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or protected veteran status.