Job Search and Career Advice Platform

Ativa os alertas de emprego por e-mail!

Senior Hpc Cluster Support Engineer (Bright Cluster Manager + Slurm)

Sky Systems, Inc. (Skysys)

Maceió

Teletrabalho

BRL 160.000 - 268.000

Tempo parcial

Há 3 dias
Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A tech company is seeking a Senior HPC Cluster Support Engineer to maintain large-scale HPC environments. This part-time, fully remote position involves managing cluster operations, troubleshooting job submissions, and monitoring system health. Ideal candidates will have strong experience with Bright Cluster Manager and Slurm, as well as skills in Linux systems administration and hardware diagnostics. This role includes collaboration with users and coordination with vendors for necessary repairs and updates.

Qualificações

  • Strong experience with Bright Cluster Manager and Slurm.
  • Linux systems administration and advanced troubleshooting.
  • Hardware diagnostics and BMC remote management tools.
  • Experience with InfiniBand, HPC storage systems (Panasas), and vendor escalation.
  • Active Directory integration for Linux is a plus.

Responsabilidades

  • Manage cluster operations, job submission issues, and troubleshooting.
  • Monitor cluster health and resolve networking issues and domain problems.
  • Diagnose hardware faults and perform remote checks using BMC tools.
  • Troubleshoot InfiniBand, Panasas storage, and network integration.
  • Coordinate repairs and escalate with vendors.
  • Apply system updates, patches, and configurations.
  • Collaborate with users and provide status updates.

Conhecimentos

Bright Cluster Manager
Slurm
Linux systems administration
Hardware diagnostics
BMC remote management tools
InfiniBand
HPC storage systems
Vendor escalation
Descrição da oferta de emprego
Role

HPC Cluster Support – CIBA 4 (Senior)

Position Type

Part-Time Contract (20 hrs / week)

Contract Duration

6 months

Work Hours

EST or PST

Location

100% Remote

Overview

We’re seeking a Senior HPC Cluster Support Engineer to maintain and support large-scale production HPC environments running Bright Cluster Manager and Slurm.

Key Responsibilities
  • Manage cluster operations, job submission issues, queue management, and user troubleshooting.
  • Monitor cluster health and resolve node failures, networking issues, and domain problems.
  • Diagnose hardware faults (GPUs, boards, power, nodes) and perform remote checks using BMC tools (Dell iDRAC, HPE iLOM, Supermicro).
  • Troubleshoot InfiniBand, Panasas storage, and network integration issues.
  • Coordinate repairs and elevate with vendors (ParkPlace, VDura).
  • Apply system updates, patches, and configurations.
  • Collaborate with users and provide regular status updates.
Required Skills
  • Strong experience with Bright Cluster Manager and Slurm.
  • Linux systems administration and advanced troubleshooting.
  • Hardware diagnostics and BMC remote management tools.
  • Experience with InfiniBand, HPC storage systems (Panasas), and vendor escalation.
  • Active Directory integration for Linux is a plus.
Obtém a tua avaliação gratuita e confidencial do currículo.
ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.