Ativa os alertas de emprego por e-mail!

Research Engineer - Distributed Training

CloudWalk

Brasil

Teletrabalho

BRL 150.000 - 200.000

Tempo integral

Ontem

Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A financial technology company in Brazil seeks a Research Engineer to design and evolve its distributed training stack for large language models. The role includes optimizing performance on multi-GPU systems and integrating cutting-edge frameworks into production. Ideal candidates will have a strong background in PyTorch and distributed training techniques. This position offers competitive salary and equity in a leading AI infrastructure firm.

Serviços

Competitive salary

Equity

Opportunity to shape AI infrastructure

Qualificações

Strong background in PyTorch and distributed training.
Hands-on experience with large-scale multi-GPU or multi-node training.
Familiarity with Transformers, Datasets, and mixed-precision techniques.

Responsabilidades

Design, implement, and maintain distributed LLM training pipeline.
Orchestrate multi-node, multi-GPU runs across Kubernetes.
Optimize performance, memory, and cost across large training workloads.

Conhecimentos

PyTorch

Distributed training

DeepSpeed

FSDP

Accelerate

Ferramentas

Ray

MLflow

W&B

Kubernetes

Slurm

About CloudWalk:

CloudWalk is building the intelligent infrastructure for the future of financial services. Powered by AI, blockchain, and thoughtful design, our systems serve millions of entrepreneurs across Brazil and the US every day.

Our AI team trains large-scale language models that power real products - from payment intelligence and credit scoring to on-device assistants for merchants.

About the Role

We’re looking for a Research Engineer to design, scale, and evolve CloudWalk’s distributed training stack for large language models. You’ll work at the intersection of research and infrastructure - running experiments across DeepSpeed, FSDP, Hugging Face Accelerate, and emerging frameworks like Unsloth, TorchTitan, and Axolotl.

You’ll own the full training lifecycle: from cluster orchestration and data streaming to throughput optimization and checkpointing at scale. If you enjoy pushing the limits of GPUs, distributed systems, and next-generation training frameworks, this role is for you.

Responsibilities

Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
Optimize performance, memory, and cost across large training workloads.
Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
Build internal tools and templates that accelerate research-to-production transitions.
Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.

Requirements

Strong background in PyTorch and distributed training (DeepSpeed, FSDP, Accelerate).
Hands‑on experience with large‑scale multi‑GPU or multi‑node training.
Familiarity with Transformers, Datasets, and mixed‑precision techniques.
Understanding of GPUs, containers, and schedulers (Kubernetes, Slurm).
Mindset for reliability, performance, and clean engineering.

Bonus

Experience with Ray, MLflow, or W&B.
Knowledge of ZeRO, model parallelism, or pipeline parallelism.
Curiosity for emerging open‑source stacks like Unsloth, TorchTitan, and Axolotl.

Process

Our process is simple: a deep conversation on distributed systems and LLM training, and a cultural interview.

Benefits

Competitive salary, equity, and the opportunity to shape the next generation of large-scale AI infrastructure at CloudWalk.

Obtém a tua avaliação gratuita e confidencial do currículo.

ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.