Overview
Job Title: Machine Learning Engineer (Inference & Systems) (also known as Inference Engineer)
Location: Work from Home (fully flexible Remote)
Job Timing: Part-Time, flexible any time that can be optionally transformed to full-time
About the Role
We’re building a next-generation cloud platform to serve multimodal AI, LLMs, vision, audio and other machine learning models at scale. As a Machine Learning Engineer (Inference & Systems), you’ll design and optimize runtime systems, OpenAI-compatible APIs, and distributed GPU pipelines for fast, cost-efficient inference and fine-tuning.
You’ll work with frameworks like vLLM, TensorRT-LLM, and TGI to design, optimize, and deploy distributed inference engines that serve text, vision, and multimodal models with low latency and high throughput. This includes deploying models such as LLaMA 3, Mistral, diffusion, ASR, TTS, and embeddings, while focusing on GPU/accelerator optimizations, software–hardware co-design, and fault-tolerant large-scale systems that power real-world applications and developer tools.
You’ll work at the intersection of machine learning, cloud infrastructure, and systems engineering, focusing on high-throughput, low-latency inference and cost-efficient deployment. This role offers a unique opportunity to shape the future of AI inference infrastructure, from cutting-edge model serving systems to production-grade deployment pipelines.
If you’re passionate about pushing the boundaries of AI inference, we’d love to hear from you!
Key Responsibilities
- Deploy and maintain LLMs (e.g., LLaMA 3, Mistral) and ML models using serving engines such as vLLM, Hugging Face TGI, TensorRT-LLM, or FasterTransformer.
- Design and develop fault-tolerant, high-concurrency, large-scale distributed inference engines for text, image, LLMs and multimodal models that are fault-tolerant, high-performance, and cost-efficient.
- Implement, optimize distributed inference and parallelism strategies: Mixture of Experts (MoE), tensor parallelism, pipeline parallelism, and related techniques for high-performance serving.
- Integrate vLLM, TGI, SGLang, FasterTransformer, and explore emerging inference frameworks.
- Build and scale an OpenAI-compatible API layer to expose models for customer use.
- Experiment with model quantization, caching, and parallelism to reduce inference costs.
- Optimize GPU usage, memory, and batching to achieve low-latency, high-throughput inference.
- Optimize GPU performance using CUDA graph optimizations, TensorRT-LLM, Triton kernels, PyTorch compilation (torch.compile), quantization, and speculative decoding to maximize efficiency.
- Work with cloud GPU providers (RunPod, Vast.ai, AWS, GCP, Azure) to manage costs and availability.
- Develop runtime inference services and APIs for LLMs, multimodal models, and fine-tuning pipelines.
- Build monitoring and observability for inference services to integrate inference metrics (latency, throughput, GPU utilization) into monitoring dashboards (Grafana, Prometheus, Loki, OpenTelemetry).
- Collaborate with backend and DevOps engineers to ensure secure, reliable APIs with rate-limiting and billing hooks.
- Document deployment processes and provide guidance to other engineers using the platform.
Requirements
- Experience: 3+ years in deep learning inference, fault-tolerant distributed systems, or high-performance computing.
- Proven experience in deploying ML/LLM models to production.
- Inference: Hands-on experience with at least one inference engine: vLLM, TGI, SGLang, TensorRT-LLM, FasterTransformer, or Triton.
- Runtime Services: Prior work implementing large-scale inference or serving pipelines.
- Solid understanding of GPU memory management, batching, and distributed inference. Strong knowledge of GPU programming (CUDA, Triton, TensorRT), compiler, model quantization, and GPU cluster scheduling.
- Experienced in the GPU/ML stack, including PyTorch, Hugging Face Transformers, and GPU-accelerated inference.
- Deep understanding of Transformer architectures, LLM/VLM/Diffusion model optimization, and KV cache systems like Mooncake, PagedAttention, or custom in-house variants that support long-context serving and inference optimization techniques.
- Comfortable working with cloud GPU platforms (AWS/GCP/Azure) or GPU marketplaces (RunPod, Vast.ai, TensorDock) to profile bottlenecks and optimize GPU utilization.
- Benchmark and tune multi-GPU clusters for throughput and memory efficiency.
- Experience building REST APIs or gRPC services (FastAPI, Flask, or similar).
- Programming: Proficient in Python, Go, Rust, C++, CUDA for high-performance systems.
- Systems knowledge: Distributed systems experience (storage, search, compute, or inference); strong understanding of multi-threading, memory management, networking, storage, and performance tuning.
- Familiarity with containerization (Docker) and orchestration (Kubernetes).
- Strong problem-solving and debugging skills across ML + infra stack.
- Familiarity with distributed storage (Ceph, HDFS, 3FS).
- Knowledge of datacenter networking (RDMA, RoCE).
Nice to Have
- Experience with Stripe or other billing systems for metered API usage.
- Experience with large-scale datacenter networking (RDMA/RoCE).
- Familiarity with distributed storage (Ceph, HDFS, 3FS).
- Knowledge of Redis or Envoy for request rate limiting.
- Familiarity with observability tools (Grafana, Prometheus, Loki).
- Exposure to MLOps pipelines (CI/CD with Azure DevOps or GitHub Actions).
- Exposure to observability stacks (Prometheus, Grafana, Loki).
- Experience with model fine-tuning pipelines and GPU scheduling.
- Understanding of rate limiting, quota enforcement, and billing hooks in ML APIs.
- Prior work at an AI infra company (Together.ai, Modal, Anyscale, Replicate, etc.)
Why Join Us?
- Work from Anywhere — 100% remote, with the freedom to work from anywhere in the world.
- Fully Flexible Shifts — complete control over your working hours; results matter more than clocking in.
- Career Growth & Fast-Track Promotions — we guarantee the quickest promotion opportunities and clear pathways for advancement.
- Professional Development — training budget, mentorship, and exposure to cutting-edge Salesforce, AI/ML, and cloud technologies.
- Global Collaboration — work with an international, diverse, and inclusive team.
- Innovative Environment — freedom to experiment with new tools, frameworks, and ideas.
- Accelerated Salary Growth + Performance Incentives — ambitious and hard-working team members are rewarded with fast upward salary progression alongside strong performance bonuses.