Job Search and Career Advice Platform

Enable job alerts via email!

Lead AI Runtime Engineer

Bespoke Technologies

Bengaluru

On-site

INR 15,00,000 - 25,00,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Bengaluru seeks an AI Runtime Engineer to design and optimize runtime infrastructure for AI models. This hands-on leadership position involves owning core architecture, enhancing system performance, and collaborating with cross-functional teams. The ideal candidate has 3+ years in systems engineering with strong skills in Python and C++, and experience with distributed systems and AI runtime optimization. Join us to push the boundaries of AI capabilities.

Qualifications

  • 3+ years of experience in systems/software engineering with exposure to AI runtime.
  • Experience in delivering PaaS services.
  • Proven experience optimizing and scaling learning runtimes.
  • Strong programming skills in Python and C++.

Responsibilities

  • Own the core runtime architecture supporting AI training and inference at scale.
  • Profile and enhance low-level system performance across training and inference pipelines.
  • Design and maintain libraries and services supporting model lifecycle.
  • Work cross-functionally with Research and Infrastructure teams.

Skills

Python
C++
AI runtime optimization
Distributed systems
Containerized workloads

Tools

Kubernetes
Ray
PyTorch
TensorFlow
Job description

AsLead/Staff AI Runtime Engineer, youll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond).

This is a hands‑on leadership role - perfect for a systems‑minded software engineer who thrives at the intersection ofAI workloads, runtimes, and performance‑critical infrastructure. You’ll own critical components of PyTorch‑based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault‑tolerant, high‑performance model execution.

What you’ll do:
Lead Runtime Design & Development
  • Own the core runtime architecture supporting AI training and inference at scale.
  • Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.
  • Optimize distributed training reliability, orchestration, and job‑level fault tolerance.
Drive Performance at Scale
  • Profile and enhance low‑level system performance across training and inference pipelines.
  • Improve packaging, deployment, and integration of customer models in production environments.
  • Ensure consistent throughput, latency, and reliability metrics across multi‑node, multi‑GPU setups.
Build Internal Tooling & Frameworks
  • Design and maintain libraries and services that support model lifecycle: training, check pointing, fault recovery, packaging, and deployment.
  • Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
  • Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.
Collaborate & Mentor
  • Work cross‑functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.
  • Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team’s capabilities.
What you’ll need to be successful:
  • 3+ years of experiencein systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
  • Experience in delivering PaaS services.
  • Proven experienceoptimizing and scaling learning runtimes(e.g. PyTorch, TensorFlow, JAX) for large‑scale training and/or inference.
  • Strong programming skills inPythonandC++(Go or Rust is a plus).
  • Familiarity withdistributed training frameworks,low-level performance tuning, andresource orchestration.
  • withmulti‑GPU,multi‑node, orcloud‑native AI workloads.
  • Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.
Bonus Points:
  • Contributions toPyTorch internalsor open‑source DL infrastructure projects.
  • Familiarity withLLM training pipelines,checkpointing, orelastic training orchestration.
  • Experience withKubernetes,Ray,TorchElastic, orcustom AI job orchestrators.
  • Background insystems research,compilers, orruntime architecturefor HPC or ML.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.