Job Summary
The HPC System Administrator will manage day-to-day operations of HPC systems, ensuring stability, security, and performance. This role includes system monitoring, patching, user account management, job queue oversight, and incident resolution to support NSCC’s supercomputing environment.
Roles and Responsibilities
System Operations & Maintenance
- Administer HPC compute nodes, storage systems, and internal networks.
- Monitor system health using tools like Grafana, Prometheus, and custom scripts.
- Apply patches, updates, and configuration changes to ensure stability.
2. User & Job Management
- Manage user accounts, access controls, and authentication mechanisms.
- Monitor job queues and assist users with job submission and scheduling issues.
- Implement and enforce resource allocation policies
3. Incident Response & Troubleshooting
- Respond to system alerts and user-reported issues.
- Document incidents, resolutions, and preventive measures.
- Collaborate with engineers for escalated issues
4. Security & Compliance
- Perform regular security checks and vulnerability assessments.
- Ensure compliance with organizational and regulatory security policies.
5. Documentation & Reporting
- Maintain system operation logs and configurationdocumentation.
- Generate reports on system usage, performance, and incidents
Qualifications
- Degree in Computer Science, Engineering, IT or related field.
- Minimum 2 years of experience in Linux system administration, preferably in HPC environments.
- Familiarity with cluster management tools (xCAT, BCM, HPCM).
- Experience with job schedulers (PBS Pro, Slurm).
- Basic understanding of RDMA interconnects (Infiniband, RoCE) and parallel file systems (Lustre, GPFS, BeeGFS).
- Understanding of basic network protocols like DHCP, DNS, TFTP, SMTP, etc
- Proficient in scripting (Python, Bash).
- Strong troubleshooting and communication skills.