Enable job alerts via email!

Senior AI SRE

Madison-Davis, LLC

United States

Remote

USD 64,000 - 720,000

Full time

5 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Site Reliability Engineer to oversee the deployment and management of AI-driven tools. The role involves ensuring reliability, architecting scalable infrastructure, and collaborating with various teams. Candidates should have strong experience in site reliability, cloud services, and coding in Python, Java, or Go. This position offers a competitive salary and the opportunity to work in a dynamic environment focused on AI solutions.

Qualifications

Strong experience in site reliability or infrastructure engineering.
Direct experience deploying or supporting AI tools.
Deep expertise with cloud-native services in AWS and/or GCP.

Responsibilities

Oversee deployment and management of AI-driven productivity tools.
Architect scalable infrastructure for AI usage.
Drive deployment efforts across major public cloud platforms.

Skills

Python

Java

Terraform

Ansible

Bash

Prometheus

Grafana

Datadog

Oversee deployment, configuration, and lifecycle management of internal AI-driven productivity tools and proprietary AI applications.
Ensure the reliability, uptime, and high performance of AI workloads and services. Drive observability practices with robust monitoring and alerting in place.
Architect and maintain scalable, resilient infrastructure to support AI usage across thousands of users. Plan and manage resource capacity to meet growth demands.
Build and maintain automation (IaC and CI/CD pipelines) to accelerate environment setup, monitoring, and support. Participate in sandbox testing environments for new use cases.
Partner closely with engineering, ML, infosec, and business operations teams to deploy and support AI solutions that drive internal productivity.
Apply best practices in data protection, access controls, and audit-readiness—especially in environments subject to regulatory oversight.
Be part of the on-call rotation and handle troubleshooting, root cause analysis, and response for AI-related outages or degradation.
Drive deployment efforts across major public cloud platforms (AWS/GCP), leveraging native services for compute, orchestration, and security.
Write, debug, and optimize code (Python, Java, or Go preferred) supporting integrations and back-end services for AI-based tooling.
Present technical insights, incident reports, and roadmap plans to both technical peers and non-technical leadership.
Strong experience in a site reliability or infrastructure engineering role supporting enterprise platforms
Direct experience deploying or supporting AI tools or intelligent automation platforms
Deep expertise with cloud-native services in AWS and/or GCP
Comfortable coding in Python, Java, or Go, especially in back-end systems or automation pipelines
Proficient with tools like Terraform, Ansible, Bash, and observability stacks (e.g., Prometheus, Grafana, Datadog)
Working knowledge of security and privacy frameworks, ideally within regulated industries (finance, healthcare, etc.)
Hands-on experience in incident response, playbook creation, and postmortem analysis
Confident communicating across business, technical, and leadership stakeholders

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Contract

Job function

Job function
Information Technology
Industries
Staffing and Recruiting

Referrals increase your chances of interviewing at Madison-Davis, LLC by 2x

CDN Site Reliability Engineer L4/L5 - Live Streaming, Open Connect CDN