Enable job alerts via email!

AI Inference Engineer

Perplexity AI

City Of London

On-site

GBP 60,000 - 80,000

Full time

30+ days ago

Job summary

An innovative AI company based in London is seeking a Machine Learning Engineer to develop and optimize AI inference APIs for real-time applications. The ideal candidate has experience with ML systems and deep learning frameworks such as PyTorch and TensorFlow. Responsibilities include improving system reliability and exploring innovative techniques for LLM optimization. Competitive compensation is offered.

Qualifications

Experience with deep learning frameworks (e.g., PyTorch, TensorFlow)
Familiarity with LLM architectures and optimization techniques.
Experience deploying distributed real-time model serving.

Responsibilities

Develop APIs for AI inference for internal and external use.
Benchmark and address bottlenecks in the inference stack.
Improve system reliability and observability.

Skills

Experience with ML systems and deep learning frameworks

Familiarity with common LLM architectures

Experience with deploying reliable model serving

Understanding of GPU architectures

Tools

PyTorch

TensorFlow

CUDA

Overview

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Our current stack includes Python, Rust, C++, PyTorch, Triton, CUDA, and Kubernetes. You will have the opportunity to work on large-scale deployment of machine learning models for real-time inference.

Responsibilities

Develop APIs for AI inference that will be used by both internal and external customers
Benchmark and address bottlenecks throughout our inference stack
Improve the reliability and observability of our systems and respond to system outages
Explore novel research and implement LLM inference optimizations

Qualifications

Experience with ML systems and deep learning frameworks (e.g. PyTorch, TensorFlow, ONNX)
Familiarity with common LLM architectures and inference optimization techniques (e.g. continuous batching, quantization, etc.)
Experience with deploying reliable, distributed, real-time model serving at scale
(Optional) Understanding of GPU architectures or experience with GPU kernel programming using CUDA

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.