About the Role
We’re looking for an Research Scientist who blends frontier research curiosity with engineering discipline. You’ll work at the core of our research efforts, training state-of-the-art models and building training infrastructure.
This role is ideal for someone who thrives in high-performance environments, understands the nuances of training large models, and is obsessed with making experimentation fast, reproducible, and reliable.
What You’ll Do
- Own and maintain a modular, high-quality PyTorch training codebase
- Design and build training workflows for scaling, checkpointing, logging, and reproducibility
- Implement new ideas, debug training runs, and accelerate iteration
- Develop and maintain efficient data loading pipelines and training utilities
- Ensure training jobs can scale across multiple GPUs and nodes (e.g., with DDP, NCCL)
- Optimize model training for performance, stability, and hardware utilization
- Maintain long-term code health: organize modules, enforce standards, write clean and testable code
- Contribute to experiment tracking, reproducibility, and versioning infrastructure
You Should Have
- Deep expertise in PyTorch, including custom modules, loss functions, and distributed training
- Proven experience training deep learning models in real-world research or production settings
- Strong engineering skills in Python (and optionally C++ for performance-critical components)
- Experience working with large datasets, complex pipelines, and real-world debugging
- Understanding of training dynamics: what goes wrong, and how to fix it
- Familiarity with job launchers, logging tools (e.g., Weights & Biases, TensorBoard), and checkpointing systems
- A mindset of engineering rigor applied to research — readable code, thoughtful design, and reproducibility
Bonus Points For
- Experience with TorchScript, ONNX, or custom inference runtimes
- Contributions to PyTorch or open-source ML tooling
- Experience working on transformer models, diffusion models, or large-scale vision/NLP tasks
- Familiarity with batch schedulers (SLURM), cluster environments, and GPU resource management
- Ability to collaborate closely with systems engineers or MLOps teams to ensure smooth integration
Why Join Us
- Collaborate with a world-class research team on meaningful, high-impact projects
- Own and shape the core training code infrastructure used daily by the team
- Work on real models, real data, and real scale — not toy problems
- Help bridge the gap between research velocity and engineering quality
- Flexible work environment with a culture that values depth, clarity, and curiosity