Enable job alerts via email!

Senior Software Engineer - Ceph

Boson AI

Toronto

On-site

CAD 90,000 - 150,000

Full time

5 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative startup is seeking a Senior Software Engineer to manage Ceph for their deep learning datacenter. This role offers the chance to work with cutting-edge NVIDIA GPUs and large-scale storage systems. The ideal candidate will have a strong background in Ceph management, high-performance computing, and a knack for problem-solving. You'll be responsible for designing and maintaining storage solutions, integrating them with deep learning infrastructure, and automating Linux systems using infrastructure-as-code practices. Join a dynamic team at the forefront of generative AI and contribute to groundbreaking projects.

Qualifications

  • Prior experience with Ceph is mandatory.
  • Strong background in maintaining Ceph clusters.
  • Experience with on-premises data center operations.

Responsibilities

  • Design, manage, and maintain large storage arrays.
  • Integrate storage solutions with deep learning infrastructure.
  • Support troubleshooting for MAAS, Slurm, and Kubernetes.

Skills

Ceph Management
Problem-Solving
Python Programming
High-Performance Computing

Tools

Slurm
MAAS
Infiniband
NVIDIA DeepOps
Kubernetes

Job description

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML, and Statistics scientists and engineers are working on high-quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior Software Engineer with deep expertise in managing Ceph for our deep learning datacenter in Toronto. The ideal candidate should have strong problem-solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Infiniband, NVIDIA DeepOps, Layer 3 networking, and related tools are a big plus. Comfort with hardware configuration is also important.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 25PB of disk and over 5PB of flash storage, Terabit networking, and hundreds of computers. Your responsibilities will include deploying and operating Ceph and integrating it with a broad range of infrastructure technologies and hardware systems.

Minimum Requirements
  • Prior experience with Ceph is mandatory.
  • Strong background in maintaining Ceph clusters.
  • Experience with high-performance computing is highly desirable.
  • Experience with on-premises data center operations and technologies.
  • Experience managing large hardware clusters.
  • Proficiency in at least one programming language (e.g., Python) with the ability to write clean, maintainable code.
  • Experience managing firmware and system updates for hardware systems, e.g., SuperMicro.
  • Strong problem-solving skills and a willingness to learn new techniques.
Day-to-Day Responsibilities
  • Design, manage, and maintain large storage arrays.
  • Integrate storage solutions with deep learning infrastructure.
  • Support troubleshooting for MAAS, Slurm, and Kubernetes as needed.
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices.
  • Learn about and deploy new tools.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Software Maintenance Engineer

Canonical

Toronto

Remote

USD 70,000 - 110,000

7 days ago
Be an early applicant

Golang System Software Engineer - Containers / Virtualisation

Canonical

Toronto

Remote

USD 70,000 - 110,000

13 days ago

Golang System Software Engineer - Containers / Virtualisation

Canonical

Mississauga

Remote

USD 80,000 - 120,000

13 days ago

Software Engineer - L3 Support

Natural Factors

Sherbrooke

Remote

USD 60,000 - 100,000

4 days ago
Be an early applicant

Software Engineer - OpenStack

Canonical

Vancouver

Remote

CAD 80,000 - 120,000

13 days ago

Software Maintenance Engineer

Canonical

Calgary

Remote

USD 70,000 - 110,000

6 days ago
Be an early applicant

Software Maintenance Engineer

Canonical

Victoria

Remote

USD 60,000 - 100,000

6 days ago
Be an early applicant

Software Maintenance Engineer

Canonical

Sherbrooke

Remote

USD 60,000 - 100,000

7 days ago
Be an early applicant

Software Engineer - OpenStack

Canonical

Calgary

Remote

CAD 70,000 - 110,000

12 days ago