Aktiviere Job-Benachrichtigungen per E-Mail!

Inference Optimization Engineer

BentoML

Deutschland

Remote

EUR 40.000 - 60.000

Vollzeit

Vor 5 Tagen

Sei unter den ersten Bewerbenden

Zusammenfassung

A leading AI company in Germany is seeking an Inference Optimization Engineer to enhance the performance of large language models at the GPU level. You will address bottlenecks, optimize resource efficiency, and contribute to open source projects. Ideal candidates have a deep understanding of transformer architecture and experience with inference engines. This position offers competitive salary, equity, and the flexibility to work remotely.

Leistungen

Competitive salary

Equity

Learning budget

Paid conference travel

Qualifikationen

Deep understanding of transformer architecture and inference engine internals.
Experience speeding up model serving.
Track record of blog posts, conference talks, or open-source projects.

Aufgaben

Identify bottlenecks and optimize inference efficiency.
Build repeatable tests that model production traffic.
Reduce memory use and compute cost.

Kenntnisse

Deep understanding of transformer architecture and inference engine internals

Hands-on experience with batching, caching, load balancing

Experienced with inference engines such as vLLM, SGLang, or TRT-LLM

Inference optimization techniques: quantization, distillation, speculative decoding

Proficiency in CUDA

Proficiency in Triton and ROCm

Tools

Nsight

nvprof

CUPTI

Role:

As an Inference Optimization Engineer, you will improve the speed and efficiency

of large language models at the GPU kernel level, through the inference engine,

and across distributed architectures. You will profile real workloads, remove

bottlenecks, and lift each layer of the stack to new performance ceilings. Every

gain you unlock will flow straight into open source code and power fleets of

production models, cutting GPU costs for teams around the world. By publishing

blog posts and giving conference talks you will become a trusted voice on

efficient LLM inference at large scale.

Example projects:

Responsibilities:

Latency & throughput: Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.
Benchmarking: Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.
Resource efficiency: Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.
Serving features: Improve batching, caching, load balancing, and model-parallel execution.
Knowledge sharing: Write technical posts, contribute code, and present findings to the open-source community.

Qualifications:

Deep understanding of transformer architecture and inference engine internals.
Hands-on experience speeding up model serving through batching, caching, load balancing.
Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).
Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.
Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.
Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us:

Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.
Technical scope – operate distributed LLM inference and large GPU clusters worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.
Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.
Compensation – competitive salary, equity, learning budget, and paid conference travel.

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.

eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.

Top-Städte

Top-Unternehmen

Beliebte Jobs

Inference Optimization Engineer

BentoML

Deutschland

Remote

EUR 40.000 - 60.000