Enable job alerts via email!

Platform System Engineer (AI Labs)

Krutrim

Palo Alto (CA)

On-site

USD 90,000 - 150,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is looking for an AI Cloud Platform System Engineer to enhance and optimize their AI training and inference platforms. This role involves designing scalable solutions for distributed AI/ML systems, optimizing workload distribution across GPU clusters, and integrating cutting-edge frameworks. You will collaborate with AI researchers to improve model architectures and ensure a resilient platform for both training and production workloads. Join a dynamic team that values collaboration and problem-solving, where your contributions will directly impact the efficiency and effectiveness of AI systems in a fast-paced environment.

Qualifications

  • 2+ years in ML infrastructure with focus on LLM training/inference platforms.
  • Proficient in Kubernetes, PyTorch, and cloud-native systems.

Responsibilities

  • Design scalable platforms for distributed AI/ML training and serverless inference.
  • Optimize GPU clusters for performance and cost efficiency.

Skills

Machine Learning Infrastructure
Kubernetes
PyTorch
Distributed Training Optimization
Networking Solutions
Problem-Solving

Education

MS/PhD in Computer Science
Equivalent hands-on experience

Tools

AWS
GCP
Azure
NVIDIA Nsight
Kubeflow
Kafka

Job description

Job Title: AI Cloud Platform System Engineer

Position Type: Full-Time

Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.

Key Responsibilities

Distributed Training/Inference Platform Development
  • Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
  • Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
  • Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
  • Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.
Platform & System Optimization
  • Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
  • Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
  • Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
  • GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.
  • Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
  • Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
  • Build observability tools for GPU utilization, model latency, and system health.
  • Leverage tools like Kubeflow, Kserve, KubeRay or SkyPilot for workflow orchestration.

Preferred Qualifications

Technical Skills
  • 2+ years of experience in ML infrastructure (LLM training/inference platforms preferred).
  • Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
  • Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
  • Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC).
Education & Soft Skills
  • MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
  • Strong collaboration skills to interface with research and engineering teams.
  • Problem-solving agility to balance performance, cost, and scalability.
Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Information Technology and Research

Industries

Research Services

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Software Engineer - Bitcoin

Babylon

Palo Alto null

Remote

Remote

USD 120,000 - 180,000

Full time

Today
Be an early applicant

Staff Software Engineer - Scalable Systems & Infrastructure

ThisWay

Palo Alto null

Remote

Remote

USD 130,000 - 180,000

Full time

4 days ago
Be an early applicant

Big Data Systems Engineer

KBR, Inc

Syracuse null

Remote

Remote

USD 100,000 - 140,000

Full time

Yesterday
Be an early applicant

Software Engineer, Interop

Mysten Labs, Inc.

null null

Remote

Remote

USD 110,000 - 150,000

Full time

Today
Be an early applicant

Senior Software Engineer - Grafana Backend Services (Remote, NASA/Canada)

Grafana Labs

Remote null

Remote

Remote

USD 148,000 - 179,000

Full time

6 days ago
Be an early applicant

Sales Operations Manager, Systems & Enablement

JUUL Labs, Inc.

null null

Remote

Remote

USD 112,000 - 167,000

Full time

6 days ago
Be an early applicant

Senior Software Engineer in Data

Kubelt

null null

Remote

Remote

USD 120,000 - 160,000

Full time

Today
Be an early applicant

Software Engineer

SOLANA FOUNDATION

Austin null

Remote

Remote

USD 100,000 - 140,000

Full time

Today
Be an early applicant

Senior Software Engineer, Distributed Systems

Walrus Foundation

null null

Remote

Remote

USD 120,000 - 180,000

Full time

9 days ago