Enable job alerts via email!

Senior AI/ML Infrastructure Engineer

eBay Inc.

London

On-site

GBP 70,000 - 100,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in ecommerce is seeking a proficient HPC architect to enhance AI infrastructure. You will be responsible for designing storage systems, developing monitoring tools, and ensuring fault tolerance. Ideal candidates hold advanced degrees and possess extensive experience in HPC, C/C++, and Linux systems.

Qualifications

  • Over 5 years of experience building HPC systems.
  • Strong preference for experience in Lustre or similar distributed filesystems.

Responsibilities

  • Architect and design high-performance storage systems.
  • Develop monitoring and observability tools for GPU clusters.
  • Streamline AI workflows with AI/ML engineers and data scientists.

Skills

C/C++ Programming
Linux Kernel and OS internals
Filesystems knowledge
Kubernetes
Hardware and Networking familiarity

Education

Masters or PhD in EE or CS

Job description

At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.

Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet.

Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all.

What you will accomplish:
  • Architect and design of high-performance storage system for GPU cluster, supporting large checkpoints and low-latency context preemption and reloads.

  • Develop monitoring and observability tools for GPU clusters

  • Maintain high availability, fault tolerance, and disaster recovery strategies for AI infrastructure

  • Work closely withAI/ML engineers, data scientists, and DevOps teamsto streamline AI workflows.

What you will bring:

  • Masters or PhD in EE or CS

  • Over 5 years of experience building HPC systems

  • C/C++ Programming– for performance-critical components and integration tasks. Lustre (Paralell filesystems is in C)

  • Linux Kernel and OS internals– to optimize system behavior and support kernel-level customization for filesystems and networking

  • Filesystems knowledge– with a strong preference for experience in Lustre or similar distributed filesystems

  • Kubernetes– for container orchestration and management at scale

  • Hardware and Networking familiarity– to work effectively with low-level infrastructure and tuning

Good to have:

  • Strong understanding of RDMA, RoCE V2 protocols

  • Hands-on experience with GPUs

  • Understanding of AI Workflows, training, inferencing

  • Understanding ofAI/ML Python frameworks (TensorFlow, PyTorch)

Please see the Talent Privacy Noticefor information regarding how eBay handles your personal data collected when you use the eBay Careers website or apply for a job with eBay.

eBay is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, veteran status, and disability, or other legally protected status.If you have a need that requires accommodation, please contact us attalent@ebay.com. We will make every effort to respond to your request for accommodation as soon as possible. View our accessibility statement to learn more about eBay's commitment to ensuring digital accessibility for people with disabilities.

The eBay Jobs website uses cookies to enhance your experience. By continuing to browse the site, you agree to our use of cookies. Visit our Privacy Center for more information.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Infrastructure Engineer — AI & ML

Recombine

London null

Hybrid

Hybrid

GBP 80,000 - 110,000

Full time

28 days ago