Enable job alerts via email!

Failure Analysis Engineer

Jobs via Dice

Menlo Park (CA)

On-site

USD 150,000 - 200,000

Full time

23 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company is seeking a Failure Analysis Engineer to manage and maintain server racks and network infrastructure. This full-time position involves supporting failure analysis initiatives and requires strong skills in Linux, Python, and AI frameworks. The role is onsite in Fremont, CA, offering a competitive hourly rate.

Qualifications

  • Intermediate level position requiring experience with server management and network protocols.
  • Familiarity with AI frameworks like TensorFlow or PyTorch is beneficial.

Responsibilities

  • Manage and maintain fleet of server racks and network infrastructure.
  • Support failure analysis initiatives and root cause analysis.

Skills

Linux
Python
Storage
Artificial Intelligence
Debugging Software
Unix
Kubernetes
Docker
Motherboard

Job description

Join to apply for the Failure Analysis Engineer role at Jobs via Dice.

Roles and Responsibilities:

  1. Manage and maintain fleet of server racks from different OEMs (network, storage, compute, and AI hardware).
  2. Interface with OEM vendors for firmware and driver update related maintenance.
  3. Support failure analysis initiatives through the utilization of available HW resources to validate rack-level, system level, and module level failures from different Meta's datacenters.
  4. Manage and maintain network infrastructure for the lab, including switches, routers, and firewalls.
  5. Configure and manage network protocols, such as TCP/IP, DNS, and DHCP.
  6. Ensure network security and compliance with company policies and industry standards.
  7. Experience working with LLMs and frameworks like TensorFlow or PyTorch.
  8. Design and implement containerized applications using Docker and Kubernetes.
  9. Manage and maintain virtual machines using hypervisors like VMware or KVM.
  10. Support failure analysis labs—inventory management, safety audits, and access controls to critical server hardware.
  11. Support root cause analysis and diagnose hardware/software issues, isolating failures in platform, firmware, BIOS, CPLD, etc.
  12. Experience working with dediprog tools (FW/BIOS debug).
  13. Provide regular updates to failure analysis lead and collaborate on mission-critical projects.

Skills required include Linux, Python, storage, artificial intelligence, debugging software, Unix, Kubernetes, Docker, motherboard.

The position is full-time, onsite in Fremont, CA, at an intermediate level, with a pay range of $60.00 - $75.00/hr.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Reliability and Failure Analysis Engineer

Empower Semiconductor

San Jose

On-site

USD 75.000 - 164.000

4 days ago
Be an early applicant

Senior Distribution Planning and Analysis Engineer

Leidos

Remote

USD 104.000 - 190.000

4 days ago
Be an early applicant

Failure Analysis Engineer

Insight Global

San Jose

On-site

USD 150.000 - 200.000

30+ days ago

Failure Analysis Engineer

Sumeru Inc

San Jose

On-site

USD 150.000 - 200.000

30+ days ago

Senior Failure Analysis Engineer

ZipRecruiter

Santa Clara

On-site

USD 120.000 - 160.000

9 days ago

Staff/Sr. Staff Power Analysis Engineer

Eridu AI

Saratoga

On-site

USD 195.000 - 280.000

6 days ago
Be an early applicant

Senior Analytics Engineer

Shout! Studios

New York

Remote

USD 120.000 - 180.000

4 days ago
Be an early applicant

Sr. Security Analytics Engineer (REMOTE)

GEICO

Austin

Remote

USD 105.000 - 230.000

4 days ago
Be an early applicant

Staff Analytics Engineer, Trust

airbnb, Inc.

Remote

USD 120.000 - 170.000

5 days ago
Be an early applicant