Job Search and Career Advice Platform

Ativa os alertas de emprego por e-mail!

Senior Hpc Cluster Support Engineer (Bright Cluster Manager + Slurm)

Sky Systems, Inc. (Skysys)

Sumaré

Teletrabalho

BRL 120.000 - 160.000

Tempo parcial

Hoje
Torna-te num dos primeiros candidatos

Cria um currículo personalizado em poucos minutos

Consegue uma entrevista e ganha mais. Sabe mais

Resumo da oferta

A technology firm is seeking a Senior HPC Cluster Support Engineer for a part-time remote position in Brazil. The successful candidate will manage operations, monitor cluster health, and diagnose hardware faults. Expertise with Bright Cluster Manager, Slurm, and Linux systems is essential. Responsibilities include troubleshooting user issues and coordinating repairs with vendors. This role requires strong skills in advanced troubleshooting and hardware diagnostics, while experience with InfiniBand and HPC storage systems will be a plus.

Qualificações

  • Strong experience with Bright Cluster Manager and Slurm.
  • Linux systems administration and advanced troubleshooting.
  • Hardware diagnostics and BMC remote management tools.

Responsabilidades

  • Manage cluster operations, job submission issues, queue management, and user troubleshooting.
  • Monitor cluster health and resolve node failures, networking issues, and domain problems.
  • Diagnose hardware faults and perform remote checks using BMC tools.

Conhecimentos

Bright Cluster Manager
Linux systems administration
Advanced troubleshooting
Hardware diagnostics
BMC remote management tools
InfiniBand
HPC storage systems (Panasas)
Vendor escalation
Active Directory integration for Linux
Descrição da oferta de emprego
Role

HPC Cluster Support – CIBA 4 (Senior)

Position Type

Part-Time Contract (20 hrs / week)

Contract Duration

6 months

Work Hours

EST or PST

Location

100% Remote

Overview

We’re seeking a Senior HPC Cluster Support Engineer to maintain and support large-scale production HPC environments running Bright Cluster Manager and Slurm.

Key Responsibilities
  • Manage cluster operations, job submission issues, queue management, and user troubleshooting.
  • Monitor cluster health and resolve node failures, networking issues, and domain problems.
  • Diagnose hardware faults (GPUs, boards, power, nodes) and perform remote checks using BMC tools (Dell iDRAC, HPE iLOM, Supermicro).
  • Troubleshoot InfiniBand, Panasas storage, and network integration issues.
  • Coordinate repairs and elevate with vendors (ParkPlace, VDura).
  • Apply system updates, patches, and configurations.
  • Collaborate with users and provide regular status updates.
Required Skills
  • Strong experience with Bright Cluster Manager and Slurm.
  • Linux systems administration and advanced troubleshooting.
  • Hardware diagnostics and BMC remote management tools.
  • Experience with InfiniBand, HPC storage systems (Panasas), and vendor escalation.
  • Active Directory integration for Linux is a plus.
Obtém a tua avaliação gratuita e confidencial do currículo.
ou arrasta um ficheiro em formato PDF, DOC, DOCX, ODT ou PAGES até 5 MB.