Enable job alerts via email!

Machine Learning Engineer

Oriole Networks

City Of London

On-site

GBP 70,000 - 90,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology firm is seeking a Machine Learning Engineer to enhance their AI/ML software stack through innovative GPU communication optimizations. The ideal candidate will have extensive experience in C++ and Python, as well as expertise in GPU programming with CUDA. This role entails close collaboration with hardware teams to seamlessly integrate solutions, offering a vibrant environment for impactful projects in high-performance computing.

Qualifications

Proficient in C++ and Python with a strong track record in HPC or ML.
Expertise in GPU programming with CUDA and optimization techniques.
Hands-on experience debugging GPU kernels using specialized tools.

Responsibilities

Design and optimize custom GPU communication kernels.
Develop and maintain distributed communication frameworks.
Collaborate with hardware and software teams for integration.

Skills

C++

Python

GPU programming with CUDA

High-performance computing

Debugging GPU kernels

Understanding of communication libraries

Distributed deep learning

Tools

Cuda-gdb

Cuda Memcheck

NSight Systems

Docker

Linux

Kubernetes

SLURM

Oriole is seeking talented Machine Learning Engineers to help co‑optimize our AI/ML software stack with cutting‑edge network hardware. You’ll be a key contributor to a high‑impact, agile team focused on integrating middleware communication libraries and modelling the performance of large‑scale AI/ML workloads.

Key Responsibilities

Design and optimize custom GPU communication kernels to enhance performance and scalability across multi‑node environments
Develop and maintain distributed communication frameworks for large‑scale deep learning models, ensuring efficient parallelization and optimal resource utilization
Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines
Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next‑generation network hardware and software stack
Contribute to system‑level architecture decisions for large‑scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure

Required Skills & Experience

Proficient in C++ and Python, with a strong track record in high‑performance computing or machine learning projects
Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization
Hands‑on experience debugging GPU kernels using tools such as Cuda‑gdb, Cuda Memcheck, NSight Systems, PTX, and SASS
Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations
Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric
Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP
Solid understanding of deploying and optimizing large‑scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.