Ativa os alertas de emprego por e-mail!

Inference Systems Engineer

Noxx

Teletrabalho

BRL 80.000 - 120.000

Tempo integral

Ontem

Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A leading tech company in Brazil is looking for an experienced Inference Systems Engineer to optimize and maintain system performance for LLM inference. This role entails managing the serving runtime, enhancing performance across various metrics, and ensuring system reliability under load. Candidates should have over 5 years of experience in building performance-sensitive systems, proficiency in Python and C++, and a strong grasp of LLM inference challenges. Competitive compensation is offered with a remote work option.

Qualificações

5+ years building high-performance systems, including LLM inference.
Strong understanding of batching vs latency tradeoffs.
Comfortable in Python/C++ with profiling tools.
Track record of successful performance enhancements.
Strong engineering practices in documentation and testing.

Responsabilidades

Own end-to-end serving runtime behavior.
Design and implement batching and scheduling strategies.
Optimize systems performance and latency stability.
Improve memory behavior and cache efficiency.
Establish performance measurement disciplines.
Drive production readiness and incident management.
Align deployment topology with performance goals.
Collaborate with product teams on interface guarantees.

Conhecimentos

High-performance systems engineering

LLM inference tradeoffs understanding

Python/C++ proficiency

Performance improvements shipping

Strong communication

Inference Systems Engineer

Remote

Infrastructure / Serving Systems

$5,651 - $6,469/month USD

Role Overview

As an Inference Systems Engineer, you will own the serving runtime that powers production LLM inference. This is a deeply technical role focused on system performance and stability: optimizing request lifecycle behavior, streaming correctness, batching/scheduling strategy, cache and memory behavior, and runtime execution efficiency. You will ship changes that improve TTFT, p95/p99 latency, throughput, and cost efficiency while preserving correctness and reliability under multi-tenant load.

You will collaborate closely with platform/infrastructure operations, networking, and API/control-plane teams to ensure the serving system behaves predictably in production and can be debugged quickly when incidents occur. This role is for engineers who can reason about the entire inference pipeline, validate improvements with rigorous measurement, and operate with production‑grade discipline.

Responsibilities

Own the end‑to‑end serving runtime behavior: request lifecycle, streaming semantics, cancellation, retries interaction, timeouts, and consistent failure modes.
Design and implement batching and scheduling strategy: dynamic batching, admission control, fairness under mixed tenants, priority lanes, and backpressure mechanisms to prevent cascading failures.
Optimize performance at the systems level: reduce time‑to‑first‑token, improve tail latency stability, increase tokens/sec throughput, and improve accelerator utilization under realistic workloads.
Improve memory behavior and cache efficiency: KV‑cache policies, fragmentation control, eviction strategies, and safeguards against OOM cliffs and performance thrash.
Drive runtime execution optimizations: operator‑level improvements, quantization integration, compilation/tuning paths where appropriate, and parameterization that produces stable performance across deployments.
Establish a performance measurement discipline: reproducible benchmarks, realistic traffic traces, profiling workflows, regression detection gates, and dashboards tied to production outcomes.
Build production readiness into the system: feature‑flagged rollouts, canarying, safe configuration changes, and incident playbooks that reduce MTTR.
Partner with networking and infrastructure operations to align deployment topology, failure domains, and capacity constraints to performance and reliability goals.
Collaborate with product and API teams to ensure the serving layer's guarantees are reflected accurately in external interfaces and customer expectations.

Requirements

5+ years building high‑performance systems (model serving, GPU systems, performance engineering, or low‑latency distributed systems).
Strong understanding of LLM inference tradeoffs: batching vs latency, prefill vs decode dynamics, cache behavior, memory pressure, and tail latency causes.
Comfort working across Python/C++ stacks with production profiling and debugging tools.
Track record of shipping performance improvements that hold up under production variance and operational constraints.
Strong engineering hygiene: tests, instrumentation, documentation, and careful rollout discipline.
Ability to communicate clearly across teams and operate calmly during incidents.

Obtém a tua avaliação gratuita e confidencial do currículo.

ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.

Principais localizações

Melhores empresas

Principais cargos