Enable job alerts via email!

Sr. System Engineer

Support Revolution

San Jose (CA)

On-site

USD 140,000 - 158,000

Full time

11 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology firm seeks a Sr. System Engineer to roll out critical applications and maintain services. The successful candidate will engage in complex testing and support tasks while leveraging expertise in Deep Learning and Machine Learning technologies. This role involves technical leadership and significant interaction with customers and partners.

Qualifications

  • 8+ years of experience in Deep Learning and Machine Learning.
  • Experience with leading AI/ML frameworks such as PyTorch, TensorFlow.
  • Hands-on experience with workload/scheduler Managers (Slurm) is preferred.

Responsibilities

  • Execute system-level tests on NVidia and AMD GPUs, Intel Xeon processors.
  • Conduct proof of concept design and testing for HPC/AI applications.
  • Deliver on-site deployment services and maintain technical documentation.

Skills

Deep Learning
Machine Learning
Linux debugging
Networking testing
DevOps
AI/ML frameworks
Scripting
Server debugging

Education

BS/MS in Electrical Engineering
Computer Engineering
Computer Science

Tools

CUDA
oneAPI
Docker
Kubernetes
OpenStack
OpenShift
Azure
AWS

Job description

Select how often (in days) to receive an alert: Create Alert

Location: San Jose, California, United States

About Supermicro:

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary:

As a Sr. System Engineer, you’ll be the go-to person to roll out and maintain business critical applications and services for Supermicro. You are also responsible for resolving escalated service issues, coaching other engineers to resolutions, engineering and implementing complex projects. You will be a person who is independent with leadership to drive the technical development and with excellent communication skills.

Essential Duties and Responsibilities:

Includes the following essential duties and responsibilities (other duties may also be assigned):
• Execute comprehensive system-level rack tests on latest NVidia and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC processors, encompassing functionality, compatibility, performance, stress, and reliability testing, leveraging proprietary in-house tools.
• Establish expertise in HPC/AI applications and benchmarks, delivering impactful training sessions to customers and partners, while addressing complex customer support issues, demonstrating innovative problem-solving skills and building robust processes and procedures for HPC/AI solutions.
• Conduct proof of concept design and testing, providing optimized benchmarks for HPC/AI applications in a timely manner. Fine-tune BIOS settings, optimize OS/network configurations, and develop diverse simulation configurations to enhance efficiency across various workloads.
• Deliver on-site deployment services, ensuring customer acceptance verification and providing post-level 1&2 support. Create and maintain technical documentation, including technical notes, blogs, and diagrams, to facilitate knowledge dissemination.
• Identify and document hardware and software quality issues and collaborate with Product Management and other Engineering teams to integrate customer feedback into future product enhancements.
• Proactively engage in HPC roadmap development, planning software and hardware upgrades to sustain exceptional HPC infrastructure performance.
• Document and analyze test plans, reports, logs, and actively contribute to the development of test utilities and automation scripts to streamline testing processes.

Qualifications:

• BS/MS in Electrical Engineering, Computer Engineering or Computer Science
• 8+ years of work-related experience in Deep Learning and Machine Learning
• 8+ years of Linux/networking debugging/testing or relevant experience preferred
• Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc.
• Experience with DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
• Hands-on experience with workload/scheduler Managers (Slurm) for rack/cluster
• Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL
• Programming experience with windows and Linux shell scripting
• Strong sense of teamwork and good team player, strong communication skills
• Familiar with Intel/AMD/NVIDIA development tool kits such as CUDA, oneAPI, ROCm is a plus
• Experience with server/network hardware debugging and troubleshooting is a plus
• CCNA, OpenStack, OpenShift, Azure or AWS is a plus

Salary Range

$140,000 - $158,000

The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs.

EEO Statement

Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Sr. Systems Engineer

Motorola Solutions

Plantation

Remote

USD 98,000 - 197,000

14 days ago

Sr Systems Engineer

Basis

Chicago

Remote

USD 111,000 - 190,000

11 days ago

Senior Systems Engineer

Cloudflare

Remote

USD 100,000 - 150,000

14 days ago

Sr. Systems Engineer

ZipRecruiter

Fremont

On-site

USD 110,000 - 150,000

Yesterday
Be an early applicant

Sr. Systems Engineer

ZipRecruiter

Santa Clara

On-site

USD 120,000 - 160,000

Yesterday
Be an early applicant

Sr. Business Systems Engineer

TalentBurst, an Inc 5000 company

San Francisco

Remote

USD 130,000 - 160,000

11 days ago

Sr. Business Systems Engineer

TalentBurst

San Francisco

Remote

USD 120,000 - 160,000

10 days ago

Sr. Business Systems Engineer

Jobs via Dice

San Francisco

Remote

USD 120,000 - 160,000

10 days ago

Sr. System Engineer

Support Revolution

San Jose

On-site

USD 140,000 - 158,000

10 days ago