Enable job alerts via email!

Hardware Systems Engineer, NPI AI Lead

The Rundown AI, Inc.

Menlo Park (CA)

On-site

USD 132,000 - 191,000

Full time

13 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is seeking a Systems Engineer to join their dynamic team focused on AI/ML initiatives. This role involves leading the validation and deployment of advanced hardware systems in data center environments. You will collaborate with cross-functional teams to tackle challenges in scaling and deployment while enhancing hardware reliability through data-driven solutions. If you are passionate about cutting-edge technology and want to contribute to transformative projects, this is the perfect opportunity for you to make a significant impact in the tech landscape.

Qualifications

4+ years of experience in troubleshooting HW systems through production cycles.
Expertise in Python, C/C++ for server system management and automation.
Experience as a team lead on wide-reaching debug projects.

Responsibilities

Lead the bring-up, validation, and deployment of cutting-edge hardware systems.
Design experiments and develop tools to monitor system health issues.
Communicate complex technical findings to both technical and non-technical audiences.

Skills

Troubleshooting HW systems

Debugging expertise

Python

C/C++

System validation

AI workloads

Education

Bachelor's degree in Computer Science

Relevant technical field experience

Tools

Oscilloscopes

Protocol analyzers

JTAG

GDB

Meta is seeking a Systems Engineer to join our Release to Production (RTP) team working on AI/ML initiatives supporting large scale AI Training and Inference. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure operates efficiently to deliver our innovative services. The RTP team is responsible for the end-to-end Hardware Lifecycle of all Meta servers including prototyping of experimental HW, pre-production hands-on system validation and hardware debugging, enabling production-ready system monitoring, automated provisioning and automated remediation of issues. RTP team also helps in exploring, developing and productizing high-performance software and hardware technologies for AI at datacenter scale.RTP Engineers work closely with HW/SW co-design teams, hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable new systems that will be deployed in our production data centers. Ramping to production and solving the datacenter scaling and deployment challenges requires us to take a systems based approach to hyperscalar bring up and validation.

Hardware Systems Engineer, NPI AI Lead Responsibilities:

Lead the bring-up, validation, and deployment of cutting-edge hardware systems in lab and datacenter environments.
Lead and drive with direct contribution to end-to-end system validation (hardware and software), with a focus on datacenter applications.
Utilize experience in accelerator and network architecture, AI workloads/ML models to design and implement robust system-level test plans, including functional, stress, and performance tests.
Develop and execute validation plans aligned with production system use cases, creating and automating corresponding test cases.
Design experiments and develop tools to detect, diagnose, and monitor system hardware/network/silicon/firmware/software health issues.
Enhance hardware reliability by creating data visualizations and implementing systemic solutions to address recurring health issues.
Communicate complex technical findings and recommendations to broad audiences, including technical and non-technical stakeholders at all levels.

Minimum Qualifications:

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
4+ years of experience in direct expertise troubleshooting and debugging HW systems through production cycles, including compute, interconnect, or GPU servers.
Expertise debugging several domains, including PCIe, serdes, networking/interconnect, flash, memory, CPU, GPU, or DRAM (DDR4/5 or HBM).
Expertise in Python, C/C++ and/or similar languages, within a Linux environment, for server system management, automation, version control, CI/CD, or similar.
4+ years of direct experience with developing test cases/plans/specifications, root-causing error codes, failure analysis, and engineering solutions through system troubleshooting and debugging (performance, characterization, integration, FW, or similar).
Experience as a team lead or significant contributor/owner of wide-reaching debug projects.

Preferred Qualifications:

5+ years of experience supporting AI/HPC system architecture at rack level and at scale, as well as debugging AI/HPC systems, performance optimizations, including familiarity with relevant tools, libraries, and frameworks (e.g., NCCL, PyTorch, CUDA)
5+ years of expertise with lab debugging tools (oscilloscopes, protocol analyzers, and traffic generators) and SoC debugging tools (e.g., JTAG, GDB, Trace32) for testing and debugging.
3+ years of experience supporting complex chipsets, including functional, stress, and performance testing and validation with focus on automation.
3+ years of direct experience in supporting and building systems/products for datacenter applications such as telemetry, video processing, AI/ML, and networking.

About Meta:

Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.

Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.

Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-ext@fb.com.

$132,000/year to $191,000/year + bonus + equity + benefits

Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Optum

San Francisco

Remote

USD 106,000 - 195,000

10 days ago

Hardware Systems Engineer, NPI AI Lead

The Rundown AI, Inc.

Menlo Park (CA)

On-site

USD 132,000 - 191,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Education

Tools

Job description

Similar jobs

Principal AI/ML Infra and Ops Engineering - Remote

San Francisco

Remote

USD 130,000 - 180,000

AI Solution Manager, ServiceNow Platform

Santa Clara

Remote

USD 163,000 - 287,000

Head of Sales - Healthcare AI

Hayward

Remote

USD 80,000 - 150,000

HEAD OF PRODUCT (AI DATA CENTERS)

Palo Alto

Remote

USD 120,000 - 180,000

Principal AI/ML Infra and Ops Engineering

San Francisco

Remote

USD 106,000 - 195,000