Enable job alerts via email!

Senior Software Engineer, MLOps

Rivian

United Kingdom

On-site

GBP 70,000 - 90,000

Full time

7 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading electric vehicle manufacturer in the United Kingdom is seeking a skilled MLOps Engineer. This role involves architecting large-scale ML training platforms using Kubernetes, optimizing GPU resource allocation, and owning continuous training and deployment pipelines. The ideal candidate will have over 5 years of engineering experience, with a strong background in MLOps and proficiency in Python. Join us in revolutionizing electric vehicles while ensuring that data-driven decisions are at the heart of our innovation. Competitive compensation and benefits are available.

Benefits

Health insurance

Remote work options

Qualifications

5+ years of engineering experience, minimum 3 years specifically in MLOps.
Extensive experience in Kubernetes for ML workloads.
Strong proficiency in Python and familiarity with Go.
Hands-on experience with Ray, Spark, or Kubeflow.

Responsibilities

Architect large-scale ML training platforms with Kubernetes.
Optimize GPU utilization and scheduling logic.
Own CI/CD pipelines for machine learning.
Build low-latency inference services.

Skills

Kubernetes expertise

ML frameworks proficiency

Python programming

Distributed compute experience

Cloud native ML knowledge

Education

5+ years of engineering experience

3+ years in MLOps or ML Infrastructure

Tools

TensorFlow

Ray

AWS services

Docker

Terraform

About Us

Rivian is on a mission to keep the world adventurous forever. This goes for the emissions‑free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.

As a company, we constantly challenge what’s possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations.

Responsibilities

Architect Training Platforms: Lead the design and implementation of large‑scale distributed training clusters using Kubernetes (EKS) and framework‑native distributed strategies (e.g., PyTorch Distributed, Ray Train).
Orchestrate GPU Resources: Optimize GPU utilization and scheduling logic (using tools like Kueue, Volcano, and Karpenter) to maximize training throughput and minimize idle costs across thousands of GPUs.
ML CI/CD (CT/CD): Own the pipelines for Continuous Training and Continuous Deployment. Automate the path from code commit, training job, model evaluation, model registry, deployment.
Model Serving Infrastructure: Build and optimize high‑throughput, low‑latency inference services using technologies like NVIDIA Triton, TorchServe, or vLLM.
Observability for ML: Implement monitoring specifically for ML workloads, including GPU‑level metrics, training stability, model drift, and inference latency (using Prometheus, Grafana, Weights & Biases, or similar).
Developer Experience: Create abstractions and CLI tools that allow Data Scientists to launch experiments without needing deep Kubernetes expertise.
Cost Optimization: Drive cost‑efficiency strategies for AWS GPU instances (Spot instances, mixed‑instance policies) and storage tiers.
Fault Tolerance: Design checkpointing and recovery strategies for long‑running training jobs to ensure resilience against node failures.

Qualifications

5+ Years of engineering experience, with at least 3+ years specifically in MLOps or ML Infrastructure.
Deep Kubernetes Expertise: Extensive experience managing EKS for batch workloads, including familiarity with CRDs, Operators, and scheduling specifically for ML (e.g., KubeRay, MPIOperator).
ML Frameworks: Strong familiarity with the operational side of PyTorch, TensorFlow, or JAX. You understand how distributed data parallel (DDP) and FSDP work at an infrastructure level.
Distributed Compute: Hands‑on experience with orchestration frameworks like Ray, Spark, or Kubeflow.
Infrastructure as Code: Proficiency with Terraform, AWS CDK, or Helm for defining ML infrastructure.
Cloud Native ML: Experience with AWS services specific to ML (SageMaker, FSx for Lustre, EFA/Elastic Fabric Adapter networking) or similar experience from GCP or Azure.
Programming: Strong proficiency in Python (required for ML tooling) and Go (preferred for K8s controllers/infra).
Model Lifecycle: Experience with Model Registries (MLflow or similar) and Feature Stores.
Containerization: Expertise in optimizing Docker containers for GPU workloads (multi‑stage builds, CUDA drivers, reducing image size).
Debugging: Experience performing Root Cause Analysis (RCA) on complex distributed systems (e.g., diagnosing NCCL communication hangs or OOM errors).

Bonus Points

Experience with NVIDIA Triton Inference Server or TensorRT optimization.
Knowledge of high‑performance networking (Infiniband, EFA, RDMA).
Contributions to open‑source MLOps projects (Ray, Kubeflow, etc.).

Equal Opportunity

Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law.

Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com.

Please note that we are currently not accepting applications from third party application services.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions