We are seeking a highly motivated and skilled engineer to join our team. The ideal candidate will have a strong background in managing server hardware including network, storage, compute, and AI. In addition, experienced in validation of failed server hardware.
Roles and Responsibilities:
- Manage and maintain fleet of server racks from different OEMs (network, storage, compute, and AI hardware).
- High performance clustered file system access and administration, preferably GPFS/IBM Scale.
- FC/Infiniband based SAN administration
- Interface with OEM vendors for firmware and driver update related maintenance.
- Support failure analysis initiatives through the utilization of available HW resources to validate rack-level, system level, module level failures from different Meta's datacenters.
- Manage and maintain network infrastructure for the lab, including switches, routers, and firewalls.
- Configure and manage network protocols, such as TCP/IP, DNS, and DHCP.
- Ensure network security and compliance with company policies and industry standards.
- Experience working with LLMs and popular frameworks such as TensorFlow or PyTorch.
- Design and implement containerized applications using Docker and Kubernetes.
- Manage and maintain virtual machines using popular hypervisors, such as VMware or KVM.
- Provide support with failure analysis labs - inventory management, safety audits, and maintaining access controls to critical server hardware.
- Support root cause analysis and diagnosing hardware/software issues. Isolate failures in platform, firmware, BIOS, CPLD, and other applications.
- Experience working with dediprog tools (FW/BIOS debug).
- Provide regular updates to failure analysis lead and collaborate with the team on different mission critical projects.
Qualifications:
- Bachelor’s or master’s degree in computer science, Electrical Engineering, or related field.
- 5+ years of experience in server rack management, lab infrastructure management, and/or related fields.
- Experience with debugging and troubleshooting complex hardware issues, including storage, compute, and AI.
- Strong experience with Linux (RedHat, Fedora, CentOS, etc.) or Unix operating systems.
- Experience with scripting languages, such as Python, PowerShell, PHP, Perl, etc.
- Experience working with containerization, Kubernetes, docker, and virtual machine management.
- Experience with failed server hardware validation, including BIOS/CPLD FW debug.
- Knowledge of network protocols, including TCP/IP, DNS, and DHCP.
- Strong knowledge of server hardware components, including motherboards, power distribution boards, and storage systems.
- Strong problem-solving skills and ability to work independently.
- Excellent communication and documentation skills.