Enable job alerts via email!

Site Reliability Engineer - AI Cloud

Super Micro Computer Spain, S.L.

San Jose (CA)

On-site

USD 145,000 - 165,000

Full time

4 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company is seeking a Site Reliability Engineer for their AI Cloud platforms. The role involves deploying and scaling cloud infrastructure, enhancing observability, and ensuring high availability across GPU-accelerated environments. Candidates with strong Linux, Kubernetes, and automation skills will thrive in this dynamic position.

Qualifications

8 years of experience required in relevant fields.
Proficiency in GPU compute clusters (NVIDIA/CUDA).
Experience with monitoring tools (Prometheus, Grafana).

Responsibilities

Design and provision cloud infrastructure using Infrastructure as Code.
Implement observability tools to monitor system health.
Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines.

Skills

Linux

Containers

Orchestration

Scripting

Observability tools

Collaboration

Communication skills

Education

Bachelor’s degree in Computer Science, Engineering, or related field

Tools

Terraform

Ansible

Kubernetes

Python

Join to apply for the Site Reliability Engineer - AI Cloud role at Super Micro Computer Spain, S.L.

6 days ago Be among the first 25 applicants

Join to apply for the Site Reliability Engineer - AI Cloud role at Super Micro Computer Spain, S.L.

Get AI-powered advice on this job and more exclusive features.

Apply now »

Date: Jun 12, 2025

Location: San Jose, California, United States

Company: Super Micro Computer

Job Req ID: 26861

About Supermicro

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary

As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You’ll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties And Responsibilities

Includes the following essential duties and responsibilities (other duties may also be assigned):

Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations.
Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance.
Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation.
Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference.
Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals.
CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools.
Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies.
Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 8 years of experience in the areas below
Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
Strong scripting and coding skills (Bash, Python, or Go).
Exposure to secure multi-tenant environments and zero trust architectures.
Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
Excellent collaboration and communication skills for cross-team, partner, and customer initiatives

Preferred Qualifications

Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow.
Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS).
Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman.
Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking.
Familiarity with ITIL processes or structured change management in production systems is a plus.
Certifications: CKA, CKAD, Linux+, or related credentials.

Salary Range

$145,000 - $165,000

The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs.

EEO Statement

Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

Job Segment: Cloud, Linux, Computer Science, Engineer, Change Management, Technology, Engineering, Management

Apply now »

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Full-time

Job function

Job function
Engineering and Information Technology
Industries
IT Services and IT Consulting

Referrals increase your chances of interviewing at Super Micro Computer Spain, S.L. by 2x

Sunnyvale, CA $117,000.00-$173,000.00 3 weeks ago

Site Reliability Engineer, AI/ML Platforms

San Jose, CA $133,900.00-$242,000.00 2 weeks ago

Software Engineer, AI Platform - New Grad

Menlo Park, CA $117,000.00-$173,000.00 12 hours ago

Fremont, CA $147,000.00-$208,000.00 3 weeks ago

Menlo Park, CA $147,000.00-$208,000.00 3 weeks ago

Mountain View, CA $125,400.00-$188,100.00 2 weeks ago

New Grads 2025 - General Software Engineer

San Jose, CA $120,000.00-$165,000.00 4 months ago

Sunnyvale, CA $147,000.00-$208,000.00 12 hours ago

Santa Clara, CA $101,000.00-$161,000.00 2 days ago

Sunnyvale, CA $197,000.00-$291,000.00 2 weeks ago

San Jose, CA $133,900.00-$242,000.00 2 weeks ago

Reliability Engineer, Chassis Systems, Semi

Santa Clara, CA $168,000.00-$322,000.00 1 day ago

Site Reliability Engineer - Observability

Palo Alto, CA $146,900.00-$194,610.00 23 hours ago

New Grads 2025 - Software Engineer, Algorithm

San Jose, CA $120,000.00-$165,000.00 9 months ago

San Francisco Bay Area $214,000.00-$260,000.00 4 hours ago

Palo Alto, CA $129,300.00-$161,600.00 2 weeks ago

Senior Site Reliability Engineer - remote

Principal Site Reliability Engineer (Wildfire Cloud Infrastructure)

Software Engineer Intern, Site Reliability Engineer

Sunnyvale, CA $104,400.00-$171,000.00 3 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs