Our client is seeking a
System Administrator to join their team on a
remote basis. To be considered for this position, you will need
experience maintaining and troubleshooting HPC clusters. Keep on reading to learn more.
About youTo be considered, you will need:
- Hands-on work with HPC clusters, including hardware, image management, local networking, and schedulers.
- A strong background in troubleshooting HPC environments to resolve incidents efficiently.
- The ability to assess scientists' HPC support needs and develop task plans accordingly.
- Proficiency in building, installing, and troubleshooting applications (GNU, Intel, Fortran, Nvidia).
- Familiarity with open-source and commercial software like Python, Anaconda, Bash scripts, EasyBuild, Spack, and MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI).
- System administration skills for Linux OS, user account management, and configuration tools (Git, MS DevOps, Ansible Playbooks).
- Knowledge of RPM/DEB packages, environment modules, and ThinLinc troubleshooting.
- Expertise in job schedulers (PBS Pro/Torque, SLURM, SGE) and CUDA installations, including GPU troubleshooting.
- Hardware management, including memory upgrades, storage arrays, power and network cabling.
- Strong documentation skills to ensure knowledge continuity.
- Secret-level security clearance (or eligibility to obtain it).
About the roleIf hired, you will:- Oversee and maintain an HPC cluster, managing hardware, networking, and scheduler configurations.
- Troubleshoot HPC environments to restore operations quickly in case of incidents.
- Work with scientists to evaluate their HPC needs and develop task plans.
- Install and support applications, resolve runtime issues, and assist with in-house software.
- Manage Linux system operations, including patching, account management, and configuration via Git and Ansible.
- Support and troubleshoot job schedulers and CUDA installations.
- Handle hardware maintenance, including memory upgrades, storage management, and networking.
- Document processes and best practices to ensure knowledge continuity.