Enable job alerts via email!

Infrastructure/Server Engineer

Talent Mingle

Freemont

On-site

CAD 80,000 - 100,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is on the lookout for a skilled engineer to enhance their server management team. This role involves overseeing a fleet of server racks, ensuring high performance and reliability across various hardware components. You will engage in troubleshooting complex hardware issues while collaborating with vendors for maintenance and updates. The ideal candidate will possess a robust background in Linux systems, scripting, and network protocols, along with a passion for problem-solving. Join this dynamic team to contribute to cutting-edge projects in a supportive and challenging environment.

Qualifications

  • 5+ years of experience in server rack management and lab infrastructure management.
  • Strong experience with Linux or Unix operating systems and scripting languages.

Responsibilities

  • Manage and maintain server racks and network infrastructure for the lab.
  • Support failure analysis initiatives and validate hardware failures.

Skills

Server Rack Management
Hardware Debugging
Linux Operating Systems
Scripting Languages
Network Protocols
Problem-Solving Skills
Communication Skills

Education

Bachelor’s Degree in Computer Science
Master’s Degree in Electrical Engineering

Tools

Docker
Kubernetes
VMware
KVM
GPFS/IBM Scale
Dediprog Tools

Job description

We are seeking a highly motivated and skilled engineer to join our team. The ideal candidate will have a strong background in managing server hardware including network, storage, compute, and AI. In addition, experienced in validation of failed server hardware.

Roles and Responsibilities:

  1. Manage and maintain fleet of server racks from different OEMs (network, storage, compute, and AI hardware).
  2. High performance clustered file system access and administration, preferably GPFS/IBM Scale.
  3. FC/Infiniband based SAN administration
  4. Interface with OEM vendors for firmware and driver update related maintenance.
  5. Support failure analysis initiatives through the utilization of available HW resources to validate rack-level, system level, module level failures from different Meta's datacenters.
  6. Manage and maintain network infrastructure for the lab, including switches, routers, and firewalls.
  7. Configure and manage network protocols, such as TCP/IP, DNS, and DHCP.
  8. Ensure network security and compliance with company policies and industry standards.
  9. Experience working with LLMs and popular frameworks such as TensorFlow or PyTorch.
  10. Design and implement containerized applications using Docker and Kubernetes.
  11. Manage and maintain virtual machines using popular hypervisors, such as VMware or KVM.
  12. Provide support with failure analysis labs - inventory management, safety audits, and maintaining access controls to critical server hardware.
  13. Support root cause analysis and diagnosing hardware/software issues. Isolate failures in platform, firmware, BIOS, CPLD, and other applications.
  14. Experience working with dediprog tools (FW/BIOS debug).
  15. Provide regular updates to failure analysis lead and collaborate with the team on different mission critical projects.

Qualifications:
  1. Bachelor’s or master’s degree in computer science, Electrical Engineering, or related field.
  2. 5+ years of experience in server rack management, lab infrastructure management, and/or related fields.
  3. Experience with debugging and troubleshooting complex hardware issues, including storage, compute, and AI.
  4. Strong experience with Linux (RedHat, Fedora, CentOS, etc.) or Unix operating systems.
  5. Experience with scripting languages, such as Python, PowerShell, PHP, Perl, etc.
  6. Experience working with containerization, Kubernetes, docker, and virtual machine management.
  7. Experience with failed server hardware validation, including BIOS/CPLD FW debug.
  8. Knowledge of network protocols, including TCP/IP, DNS, and DHCP.
  9. Strong knowledge of server hardware components, including motherboards, power distribution boards, and storage systems.
  10. Strong problem-solving skills and ability to work independently.
  11. Excellent communication and documentation skills.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Software Engineer - packaging - optimize Ubuntu Server for public clouds

Canonical

Hamilton

Remote

CAD 60,000 - 90,000

Today
Be an early applicant

Software Engineer - packaging - optimize Ubuntu Server for public clouds

Canonical

Montreal

Remote

CAD 80,000 - 120,000

Today
Be an early applicant

Software Engineer - packaging - optimize Ubuntu Server

Canonical

Waterloo

Remote

CAD 70,000 - 90,000

4 days ago
Be an early applicant

Senior Software Engineer - packaging - optimize Ubuntu Server

Canonical

Calgary

Remote

USD 80,000 - 120,000

22 days ago

Senior Software Engineer - packaging - optimize Ubuntu Server

Canonical

Ottawa

Remote

USD 80,000 - 120,000

21 days ago

Software Engineer - packaging - optimize Ubuntu Server for public clouds

Canonical

Edmonton

Remote

CAD 60,000 - 95,000

22 days ago

Software Engineer - packaging - optimize Ubuntu Server

Canonical

Ottawa

Remote

USD 60,000 - 100,000

22 days ago

Software Engineer - packaging - optimize Ubuntu Server

Canonical

Hamilton

Remote

USD 70,000 - 110,000

22 days ago

Staff Software Engineer, Server Security

MongoDB

Remote

CAD 90,000 - 150,000

30+ days ago