Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training - Performance Optimization

Sii tra i primi a mandare la candidatura.
Solo per membri registrati
Asti
EUR 127.000 - 221.000
Sii tra i primi a mandare la candidatura.
Ieri
Descrizione del lavoro

Overview

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training - Performance Optimization role within AWS Utility Computing (UC) and Annapurna Labs. The role focuses on development, enablement and performance tuning of ML model training and inference on the AWS Neuron stack, including Trn1/Inf1 servers, for large-scale model families and cutting-edge cloud AI services. The candidate will work with a team to enable distributed training and inference across PyTorch, TensorFlow, and JAX using XLA and the Neuron compiler/runtime stack, and will implement and optimize using libraries such as FSDP and DeepSpeed.

This role is part of the ML Apps team that collaborates with chip architects, compiler engineers and runtime engineers to build, tune and optimize distributed training solutions for Neuron-based systems.

Key responsibilities

  • Lead efforts building distributed training and inference support into PyTorch, TensorFlow, and JAX using XLA and the Neuron stacks.
  • Tune models to achieve highest performance and efficiency on AWS Trainium and Inferentia silicon and on TRn1/Inf1 servers.
  • Collaborate with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trn1.
  • Develop and enable support for a wide variety of ML model families (e.g., GPT-2, GPT-3 and beyond, stable diffusion, Vision Transformers, and more).
  • Experience training large models with Python and integrate distributed training libraries such as FSDP and DeepSpeed into Neuron-based systems.

Basic Qualifications

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming experience in at least one software language
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems
  • 5+ years of full software development lifecycle, including coding standards, code reviews, source control, build processes, testing, and operations
  • Experience as a mentor, tech lead or leading an engineering team

Preferred Qualifications

  • Bachelor's degree in computer science or equivalent
  • Machine Learning knowledge in frameworks and end-to-end model training

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture supports accommodations for disability during the application and hiring process. For more information, visit the Amazon accommodations page. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

Our compensation reflects the cost of labor across several US geographic markets. The base pay range for this position is $151,300/year to $261,500/year, pay is based on location and experience. Amazon is a total compensation company; depending on the role, equity, sign-on payments and other benefits may be provided.

This position will remain posted until filled. Applicants should apply via our internal or external career site.

Posted: May 16, 2025 (Updated about 17 hours ago)

Posted: September 20, 2025 (Updated 1 day ago)

Posted: September 1, 2025 (Updated 1 day ago)

Posted: August 27, 2025 (Updated 1 day ago)

Posted: June 24, 2025 (Updated 2 days ago)

Share this job

Important FAQs for current Government employees

Before proceeding, please review the following FAQs

https://www.amazon.jobs/en/faqs#faqs-for-us-government-employees

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.