Enable job alerts via email!

Technical Systems Engineer - HPC + NVIDIA GPU + DGX ( AI Infra Engineer)

Cisco Systems

United States

On-site

USD 120,000 - 180,000

Full time

7 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company seeks a technical leader for its Cloud Infrastructure and Platform Services team, focusing on building and managing AI platforms. The successful candidate will lead GPU cluster implementations, automate processes, and ensure high performance across integrated systems, driving innovation with cutting-edge technology.

Qualifications

  • Experienced in deploying and administrating HPC clusters.
  • Proficient in GPU resource scheduling managers (Slurm preferred).
  • Deep understanding of computer networks and high-performance applications.

Responsibilities

  • Build and manage internal AI platform at Cisco.
  • Implement GPU compute clusters for deep learning.
  • Automate maintenance of GPU system availability.

Skills

Artificial Intelligence
Machine Learning
Data Analytics
Linux Systems Administration
GPUs
DevOps
Programming
Agile Methodologies

Education

BA, BS, or MS in Computer Science, Electrical Engineering, Computer Engineering

Tools

Git
Jira
GitLab

Job description

Who We Are

Today’s ambitious business environment is more than that – it’s a period of disruption between the pandemic, global business change and internal process complexity. For us to focus on simplicity and the best customer experience, we need great talent and the right abilities to be successful. This is now a mantra for our Cisco leadership team and for us.

Cisco’s Information Technology team is changing the way we run Cisco’s operations by improving the power of technology, the best of business processes and outstanding data insights. Together, we will Reinvent the Cisco experience. Show the world how to Reinvent applications and demonstrate the future of the Internet to Showcase the power of Cisco: our people, products, processes, systems, and data. Please join us and make this journey together!

What You Will Do

Cisco IT is building, developing, and expanding our artificial intelligence platform, which will empower the business to fundamentally change the world. You will be a critical member of the Cloud Infrastructure and Platform Services (CIPS) organization building and managing the internal AI platform at Cisco. You will provide leadership in the design and implementation of GPU compute cluster that runs demanding deep learning, high performance computing, and computationally intensive workloads. You will be responsible for AI hardware analysis, design, procurement, and support. You will be an expert in identifying architectural changes and/or completely innovative approaches for our artificial intelligence platform.

  • Technical hand-on role in building and supporting NVIDIA based artificial intelligence platforms.
  • Plan, build and install/upgrade new systems that support NVIDIA DGX hardware and software.
  • Automate configuration management, software updates, and maintenance and monitoring of GPU system availability using modern DevOps tools (Ansible, Gitlab, etc.)
  • Lead the advancement of artificial intelligence platforms and practices.
  • Administer Linux systems, ranging from powerful GPU enabled servers to general-purpose compute systems.
  • Collaborate closely with internal Cisco Business Units, application teams and cross-functional technical domains.
  • Create written technical designs, documents, and presentations.
  • Stay up to date with AI industry advancements and cutting-edge technologies.
  • Accelerate the delivery of AI capabilities across our portfolio.
  • Design new tools to monitor alerts that will help discover failures or issues before our customers.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Who You’ll Work With

When you work with us, you will work as part of a hardware and software engineering team that designs and develops Hybrid-Cloud compute platforms and capabilities that are crucial to keeping Cisco’s critical business applications and processes available.

Who You Are

You are an experienced technical leader in artificial intelligence, machine learning, data analytics, software engineering, and managing complex integrated systems. An excellent collaborator who can partner, lead, teach, and communicate advanced technical concepts. A talented and passionate engineer comfortable working in high-pressure, large scale enterprise environments.

Our Minimum Requirements include:
  • You have a BA, BS, or MS in CS, EE, CE or equivalent experience.
  • 5+ years of previous experience deploying and administrating HPC clusters.
  • Familiar with GPU resource scheduling managers (Slurm (preferred), RunAI, etc.).
  • Proficient in Hybrid Cloud, Virtualization, and Container technologies.
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Familiar with project tracking tools (e.g. Jira), Git (any Version Control systems), and CI/CD systems (e.g. GitLab, GitHub Actions, Jenkins).
  • Proficient in general purpose programming languages (Python, GoLang, C/C++) and development platforms and technologies (GIT, JIRA, Jenkins, etc.).
  • Experience with Agile and DevOps operating models.
  • Hard-working dedication to provide quality in support for your customers.
  • Established record of leading technical initiatives and delivering results.

Why Cisco?

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint. Simply put – we power the future.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.

We are Cisco, and our power starts with you.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

DGX Cloud Automation Engineer

NVIDIA

Remote

USD 144 000 - 271 000

2 days ago
Be an early applicant

Technical Project Manager, HPC

ZipRecruiter

Kansas City

Remote

USD 120 000 - 160 000

Yesterday
Be an early applicant

Solution Architect, GenAI

Lenovo

Remote

USD 170 000 - 190 000

30+ days ago

HPC Site Reliability Engineer

asobbi

Town of Texas

On-site

USD 120 000 - 160 000

6 days ago
Be an early applicant

HPC Site Reliability Engineer (SRE) Engineering US, Remote Working

ORI

Remote

USD 120 000 - 160 000

13 days ago

Sales / Solution Engineer - U.S. based

Lablup Inc.

San Jose

Remote

USD 77 000 - 150 000

9 days ago

HPC/AI Linux Administrator (Scientist 2/3)

Los Alamos National Security LLC

Los Alamos

Hybrid

USD 101 000 - 207 000

7 days ago
Be an early applicant

Sr. IT Network Engineer

Rivian

Palo Alto

On-site

USD 170 000 - 210 000

9 days ago

Senior Storage Production Engineer - DGX Cloud

NVIDIA

Remote

USD 148 000 - 288 000

17 days ago