Activez les alertes d’offres d’emploi par e-mail !

SRE specialized in HPC/AI - H/F

Groupe iliad

Paris

Sur place

EUR 45 000 - 70 000

Plein temps

Il y a 29 jours

Générez un CV personnalisé en quelques minutes

Décrochez un entretien et gagnez plus. En savoir plus

Repartez de zéro ou importez un CV existant

Résumé du poste

Une entreprise dynamique spécialisée dans les solutions cloud recrute un technicien HPC pour déployer et maintenir des clusters avec matériel Nvidia. Ce rôle clé implique l'automatisation, l'optimisation des outils et la gestion du cycle de vie des HPC, tout en encourageant la collaboration au sein de l'équipe. Si vous avez une passion pour la technologie et les pratiques SRE, venez rejoindre cette aventure enrichissante à Paris ou Lille.

Qualifications

  • Expérience dans la programmation système avec Python, Bash, Go.
  • Compétences en systèmes Linux (Debian, CentOS) requises.
  • Connaissance des environnements HPC et des meilleures pratiques DevOps.

Responsabilités

  • Assurer le déploiement et la santé des composants HPC.
  • Créer et optimiser des outils, résoudre des problèmes critiques.
  • Gérer le cycle de vie des clusters HPC, garantir la qualité du service.

Connaissances

Python
Bash
Linux
DevOps
SRE
Nvidia
Cuda
MPI
Monitoring
Automation

Outils

Ansible
GitLab
Prometheus
Sentry
Kubernetes
Jira

Description du poste

Social network you want to login/join with:

Le poste

Our mission

Founded in 1999, Scaleway, the cloud of choice, helps developers and businesses to build, deploy, and scale applications to any infrastructure. Located in Paris, Amsterdam, and Warsaw, Scaleway’s complete cloud ecosystem is used by 25,000+ businesses, including European startups, who choose Scaleway for its multi-AZ redundancy, smooth developer experience, carbon-neutral data centers, and native tools for managing multi-cloud architectures.

With fully managed offerings for bare metal, containerization, and serverless architectures, Scaleway provides options for cloud computing, allowing customers to decide where their data resides, which architecture suits their needs, and to scale responsibly.

Our journey

We aim for all our actions to bring us closer to our vision: building and scaling technologies that matter to us, our customers, and end users. Scaleway is a challenger in the cloud space, and as we grow, serving a diverse, global customer base is key. We believe that incorporating diverse perspectives, knowledge, skills, and cross-cultural understanding is vital for delivering high value and performance.

Our values

  • Singularity: We do it our own way.
  • Community: One company, one culture.
  • Adventure: Level up if you dare, never stop innovating.
  • Leadership: Be the leader you want to follow.
  • Excellence: We want to be customers' first choice as a cloud provider.
  • RockSolid: You can always count on us!

About the job

We are deploying multiple HPC clusters with Nvidia hardware, aiming to be among the top 15 in the Top500 list. Reporting to our Engineering Manager Emerick Mounoury, you will ensure the deployment and health of HPC components. A strong background in HPC environments, system administration, DevOps, and SRE best practices is required. Our systems evolve, and so must our monitoring and resilience tools.

Profil recherché

  • Experience in system programming using Python, Bash, Go, etc.
  • Ability to troubleshoot production failures
  • Positive mindset and team spirit
  • Passion for automation and tooling improvements
  • Experience with Linux systems (Debian, CentOS)
  • Experience with batch schedulers like Slurm, OAR, SGE
  • Understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, Infiniband, VLANs
  • Storage knowledge: large pools, NAS, S3
  • Experience with Nvidia, Cuda, MPI
  • Good command of English
  • Ability to identify and solve bugs in codebases
  • Experience with infrastructure-as-code and continuous deployment
  • Experience with hardware automation
  • Monitoring & logging system experience
  • Account management (LDAP)
  • Knowledge of cloud platforms and use-cases
  • Experience as OSS contributor/maintainer
  • Knowledge in AI/ML/Neuronal networks

Additional responsibilities include creating and optimizing tools, troubleshooting high-impact issues, on-call duties, ensuring high service quality, managing HPC cluster lifecycle, and implementing best practices for stability, resiliency, scalability, security, and performance.

Skills required include Python/Bash, MySQL, S3 API, Lustre, NAS, monitoring tools (Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana), configuration management (Ansible, Salt), version control (GitLab, Nexus), Linux distributions (Ubuntu, Debian, CentOS), Nvidia hardware/software, MPI, Modules, AI software, Slurm, Kubernetes, and collaboration tools (Jira, Confluence, Slack, GSuite).

Location: Based in Paris or Lille, partial remote possible.

Recruitment Process: Screening call, technical interviews, HR interview, offer within 2-3 weeks. Don’t hesitate to apply even if you don’t meet all criteria.

To learn more about us

Obtenez votre examen gratuit et confidentiel de votre CV.
ou faites glisser et déposez un fichier PDF, DOC, DOCX, ODT ou PAGES jusqu’à 5 Mo.