Activez les alertes d’offres d’emploi par e-mail !

SRE specialized in HPC/AI - H/F

Groupe iliad

Paris

Sur place

EUR 45 000 - 70 000

Plein temps

Il y a 29 jours

Générez un CV personnalisé en quelques minutes

Décrochez un entretien et gagnez plus. En savoir plus

Repartez de zéro ou importez un CV existant

Résumé du poste

Une entreprise dynamique spécialisée dans les solutions cloud recrute un technicien HPC pour déployer et maintenir des clusters avec matériel Nvidia. Ce rôle clé implique l'automatisation, l'optimisation des outils et la gestion du cycle de vie des HPC, tout en encourageant la collaboration au sein de l'équipe. Si vous avez une passion pour la technologie et les pratiques SRE, venez rejoindre cette aventure enrichissante à Paris ou Lille.

Qualifications

Expérience dans la programmation système avec Python, Bash, Go.
Compétences en systèmes Linux (Debian, CentOS) requises.
Connaissance des environnements HPC et des meilleures pratiques DevOps.

Responsabilités

Assurer le déploiement et la santé des composants HPC.
Créer et optimiser des outils, résoudre des problèmes critiques.
Gérer le cycle de vie des clusters HPC, garantir la qualité du service.

Connaissances

Python

Bash

Linux

DevOps

SRE

Nvidia

Cuda

MPI

Monitoring

Automation

Outils

Ansible

GitLab

Prometheus

Sentry

Kubernetes

Jira

Social network you want to login/join with:

Le poste

Our mission

Founded in 1999, Scaleway, the cloud of choice, helps developers and businesses to build, deploy, and scale applications to any infrastructure. Located in Paris, Amsterdam, and Warsaw, Scaleway’s complete cloud ecosystem is used by 25,000+ businesses, including European startups, who choose Scaleway for its multi-AZ redundancy, smooth developer experience, carbon-neutral data centers, and native tools for managing multi-cloud architectures.

With fully managed offerings for bare metal, containerization, and serverless architectures, Scaleway provides options for cloud computing, allowing customers to decide where their data resides, which architecture suits their needs, and to scale responsibly.

Our journey

We aim for all our actions to bring us closer to our vision: building and scaling technologies that matter to us, our customers, and end users. Scaleway is a challenger in the cloud space, and as we grow, serving a diverse, global customer base is key. We believe that incorporating diverse perspectives, knowledge, skills, and cross-cultural understanding is vital for delivering high value and performance.

Our values

Singularity: We do it our own way.
Community: One company, one culture.
Adventure: Level up if you dare, never stop innovating.
Leadership: Be the leader you want to follow.
Excellence: We want to be customers' first choice as a cloud provider.
RockSolid: You can always count on us!

About the job

We are deploying multiple HPC clusters with Nvidia hardware, aiming to be among the top 15 in the Top500 list. Reporting to our Engineering Manager Emerick Mounoury, you will ensure the deployment and health of HPC components. A strong background in HPC environments, system administration, DevOps, and SRE best practices is required. Our systems evolve, and so must our monitoring and resilience tools.

Profil recherché

Experience in system programming using Python, Bash, Go, etc.
Ability to troubleshoot production failures
Positive mindset and team spirit
Passion for automation and tooling improvements
Experience with Linux systems (Debian, CentOS)
Experience with batch schedulers like Slurm, OAR, SGE
Understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, Infiniband, VLANs
Storage knowledge: large pools, NAS, S3
Experience with Nvidia, Cuda, MPI
Good command of English
Ability to identify and solve bugs in codebases
Experience with infrastructure-as-code and continuous deployment
Experience with hardware automation
Monitoring & logging system experience
Account management (LDAP)
Knowledge of cloud platforms and use-cases
Experience as OSS contributor/maintainer
Knowledge in AI/ML/Neuronal networks

Additional responsibilities include creating and optimizing tools, troubleshooting high-impact issues, on-call duties, ensuring high service quality, managing HPC cluster lifecycle, and implementing best practices for stability, resiliency, scalability, security, and performance.

Skills required include Python/Bash, MySQL, S3 API, Lustre, NAS, monitoring tools (Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana), configuration management (Ansible, Salt), version control (GitLab, Nexus), Linux distributions (Ubuntu, Debian, CentOS), Nvidia hardware/software, MPI, Modules, AI software, Slurm, Kubernetes, and collaboration tools (Jira, Confluence, Slack, GSuite).

Location: Based in Paris or Lille, partial remote possible.

Recruitment Process: Screening call, technical interviews, HR interview, offer within 2-3 weeks. Don’t hesitate to apply even if you don’t meet all criteria.

To learn more about us

Obtenez votre examen gratuit et confidentiel de votre CV.

ou faites glisser et déposez un fichier PDF, DOC, DOCX, ODT ou PAGES jusqu’à 5 Mo.

Noté « Excellent » sur la base de 17 140 évaluations

SRE specialized in HPC/AI - H/F

Groupe iliad

Paris

Sur place

EUR 45 000 - 70 000