- Oversee deployment, configuration, and lifecycle management of internal AI-driven productivity tools and proprietary AI applications.
- Ensure the reliability, uptime, and high performance of AI workloads and services. Drive observability practices with robust monitoring and alerting in place.
- Architect and maintain scalable, resilient infrastructure to support AI usage across thousands of users. Plan and manage resource capacity to meet growth demands.
- Build and maintain automation (IaC and CI/CD pipelines) to accelerate environment setup, monitoring, and support. Participate in sandbox testing environments for new use cases.
- Partner closely with engineering, ML, infosec, and business operations teams to deploy and support AI solutions that drive internal productivity.
- Apply best practices in data protection, access controls, and audit-readiness—especially in environments subject to regulatory oversight.
- Be part of the on-call rotation and handle troubleshooting, root cause analysis, and response for AI-related outages or degradation.
- Drive deployment efforts across major public cloud platforms (AWS/GCP), leveraging native services for compute, orchestration, and security.
- Write, debug, and optimize code (Python, Java, or Go preferred) supporting integrations and back-end services for AI-based tooling.
- Present technical insights, incident reports, and roadmap plans to both technical peers and non-technical leadership.
- Strong experience in a site reliability or infrastructure engineering role supporting enterprise platforms
- Direct experience deploying or supporting AI tools or intelligent automation platforms
- Deep expertise with cloud-native services in AWS and/or GCP
- Comfortable coding in Python, Java, or Go, especially in back-end systems or automation pipelines
- Proficient with tools like Terraform, Ansible, Bash, and observability stacks (e.g., Prometheus, Grafana, Datadog)
- Working knowledge of security and privacy frameworks, ideally within regulated industries (finance, healthcare, etc.)
- Hands-on experience in incident response, playbook creation, and postmortem analysis
- Confident communicating across business, technical, and leadership stakeholders
Seniority level
Seniority level
Mid-Senior level
Employment type
Job function
Job function
Information TechnologyIndustries
Staffing and Recruiting
Referrals increase your chances of interviewing at Madison-Davis, LLC by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.
CDN Site Reliability Engineer L4/L5 - Live Streaming, Open Connect CDN
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
United States $147,000.00-$208,000.00 1 week ago
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
United States $100,000.00-$720,000.00 1 day ago
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
United States $170,000.00-$720,000.00 1 week ago
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
Site Reliability Engineer - 100 % Remote
United States $64,000.00-$112,000.00 2 weeks ago
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
United States $140,000.00-$180,000.00 3 weeks ago
United States $150,000.00-$200,000.00 1 week ago
United States $170,000.00-$210,000.00 5 days ago
Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - USA)
Site Reliability Engineer - Analytics and Visualization Platform
Site Reliability Engineer (SRE, Remote US)
Austin, TX $120,000.00-$160,000.00 2 months ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.