Job Search and Career Advice Platform

Ativa os alertas de emprego por e-mail!

Senior HPC Cluster Support Engineer (Bright Cluster Manager + Slurm)

Sky Systems, Inc. (SkySys)

São Paulo

Teletrabalho

BRL 120.000 - 160.000

Tempo parcial

Há 3 dias
Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A technology company is seeking a Senior HPC Cluster Support Engineer for remote work to manage and support large-scale production HPC environments. Responsibilities include cluster operations, hardware troubleshooting, and user support. Ideal candidates will have strong experience with Bright Cluster Manager, Slurm, and Linux systems administration. This part-time contract role offers 20 hours per week and is focused on maintaining uninterrupted high-performance computing workloads.

Qualificações

  • Strong experience with Bright Cluster Manager and Slurm.
  • Linux systems administration and advanced troubleshooting skills.
  • Expertise in hardware diagnostics and BMC remote management tools.

Responsabilidades

  • Manage and support HPC clusters and resolve user troubleshooting issues.
  • Monitor cluster health and address node failures and networking issues.
  • Diagnose hardware faults using remote checks and BMC tools.

Conhecimentos

Bright Cluster Manager
Slurm
Linux systems administration
Hardware diagnostics
BMC remote management tools
Experience with InfiniBand
Descrição da oferta de emprego

Role : HPC Cluster Support – CIBA 4 (Senior) Position Type : Part-Time Contract (20hrs / week)

Contract Duration : 6 months

Work Hours : EST or PST

Location : 100% Remote

We're seeking a Senior HPC Cluster Support Engineer to maintain and support large-scale production HPC environments running Bright Cluster Manager and Slurm. This role focuses on cluster operations, hardware troubleshooting, user support, and vendor coordination to ensure uninterrupted high-performance computing workloads.

Key Responsibilities
  • Manage and support HPC clusters: job submission issues, queue management, and user troubleshooting
  • Monitor cluster health and resolve node failures, networking issues, and domain problems
  • Diagnose hardware faults (GPUs, boards, power, nodes) and perform remote checks using BMC tools (Dell iDRAC, HPE iLOM, Supermicro)
  • Troubleshoot InfiniBand, Panasas storage, and network integration issues
  • Coordinate repairs and elevate with vendors (ParkPlace, VDura)
  • Apply system updates, patches, and configurations
  • Collaborate with users and provide regular status updates
Required Skills
  • Strong experience with Bright Cluster Manager and Slurm
  • Linux systems administration and advanced troubleshooting
  • Hardware diagnostics, BMC remote management tools
  • Experience with InfiniBand, HPC storage systems (Panasas), and vendor escalation
  • Active Directory integration for Linux is a plus
Obtém a tua avaliação gratuita e confidencial do currículo.
ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.