Enable job alerts via email!

Platform System Engineer (AI Labs)

Krutrim

Palo Alto (CA)

On-site

USD 90,000 - 150,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is looking for an AI Cloud Platform System Engineer to enhance and optimize their AI training and inference platforms. This role involves designing scalable solutions for distributed AI/ML systems, optimizing workload distribution across GPU clusters, and integrating cutting-edge frameworks. You will collaborate with AI researchers to improve model architectures and ensure a resilient platform for both training and production workloads. Join a dynamic team that values collaboration and problem-solving, where your contributions will directly impact the efficiency and effectiveness of AI systems in a fast-paced environment.

Qualifications

2+ years in ML infrastructure with focus on LLM training/inference platforms.
Proficient in Kubernetes, PyTorch, and cloud-native systems.

Responsibilities

Design scalable platforms for distributed AI/ML training and serverless inference.
Optimize GPU clusters for performance and cost efficiency.

Skills

Machine Learning Infrastructure

Kubernetes

PyTorch

Distributed Training Optimization

Networking Solutions

Problem-Solving

Education

MS/PhD in Computer Science

Equivalent hands-on experience

Tools

AWS

GCP

Azure

NVIDIA Nsight

Kubeflow

Kafka

Job Title: AI Cloud Platform System Engineer

Position Type: Full-Time

Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.

Key Responsibilities

Distributed Training/Inference Platform Development

Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.

Platform & System Optimization

Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.
Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
Build observability tools for GPU utilization, model latency, and system health.
Leverage tools like Kubeflow, Kserve, KubeRay or SkyPilot for workflow orchestration.

Preferred Qualifications

Technical Skills

2+ years of experience in ML infrastructure (LLM training/inference platforms preferred).
Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC).

Education & Soft Skills

MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
Strong collaboration skills to interface with research and engineering teams.
Problem-solving agility to balance performance, cost, and scalability.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Information Technology and Research

Industries

Research Services

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Platform System Engineer (AI Labs)

Krutrim

Palo Alto (CA)

On-site

USD 90,000 - 150,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Education

Tools

Job description

Similar jobs

Senior Software Engineer - Bitcoin

Palo Alto null

Remote

Remote

USD 120,000 - 180,000

Full time

Staff Software Engineer - Scalable Systems & Infrastructure

Palo Alto null

Remote

Remote

USD 130,000 - 180,000

Full time

Big Data Systems Engineer

Syracuse null

Remote

Remote

USD 100,000 - 140,000

Full time

Software Engineer, Interop

null null

Remote

Remote

USD 110,000 - 150,000

Full time

Senior Software Engineer - Grafana Backend Services (Remote, NASA/Canada)

Remote null

Remote

Remote

USD 148,000 - 179,000

Full time

Sales Operations Manager, Systems & Enablement

null null

Remote

Remote

USD 112,000 - 167,000

Full time

Senior Software Engineer in Data

null null

Remote

Remote

USD 120,000 - 160,000

Full time

Software Engineer

Austin null

Remote

Remote

USD 100,000 - 140,000

Full time

Senior Software Engineer, Distributed Systems

null null

Remote

Remote

USD 120,000 - 180,000

Full time