Senior Applied Scientist – ML Systems, Training & Inference Optimization
We are seeking an exceptional Senior Applied Scientist specializing in ML Systems, training, and inference optimization to join DS3. This role requires deep expertise in performance engineering, kernel development, distributed systems optimization, and AI workload optimization across heterogeneous compute platforms. You will invent and implement novel optimization techniques that directly impact the performance and cost‑efficiency of ML training and inference for AWS customers worldwide.
As a Senior Applied Scientist in DS3, you will work at the lowest levels of the software stack—writing custom CUDA kernels, optimizing PTX assembly, developing high‑performance operators for GPUs and AWS Neuron, designing efficient communication patterns for multi‑GPU and multi‑node training, and inventing new algorithmic approaches to accelerate transformer models and emerging architectures. Your work will span from single‑node inference optimization to large‑scale distributed training systems, influencing the design of AWS training and inference services and setting new standards for ML systems performance across the industry.
Deep Science for Systems and Services (DS3) is a part of AWS Utility Computing (UC) which provides product innovations—from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry.
Key Job Responsibilities
- Systems‑Level Scientific Innovation: Design and implement novel kernel‑level optimizations for ML inference and training workloads, including custom CUDA kernels, PTX‐level optimizations, and cross‑platform acceleration for CUDA and AWS Neuron SDK.
- Performance Engineering Leadership: Drive 2–10× performance improvements in latency, throughput, and memory efficiency for production ML inference & training systems through systematic profiling, analysis, and optimization.
- Cross‑Platform Optimization: Develop and port high‑performance ML operators across GPUs, AWS Inferentia/Trainium, and emerging AI accelerators, ensuring optimal performance on each platform.
- Product‑Level Impact: Lead the design, implementation, and delivery of scientifically‑complex optimization solutions that directly improve customer experience and reduce AWS operational costs at scale.
- Scientific Rigor: Produce technical documentation and internal research reports demonstrating the correctness, efficiency, and scalability of your optimizations. Contribute to external publications when aligned with business needs.
- Technical Leadership: Influence your team's technical direction and scientific roadmap. Build consensus across engineering and science teams on optimization strategies and architectural decisions.
- Mentorship & Knowledge Sharing: Actively mentor junior scientists and engineers on performance engineering best practices, kernel development, and systems‑level optimization techniques.
Qualifications
- PhD in Computer Science, Computer Engineering, or related technical field, OR Master’s degree with 8+ additional years of relevant research/industry experience.
- 5+ years of hands‑on experience in performance optimization and systems programming for AI/ML workloads.
- Expert‑level proficiency in CUDA programming and GPU architecture, with demonstrated ability to write high‑performance custom kernels.
- Proven track record of delivering measurable performance improvements (2× or greater) in production systems.
- Strong C/C++ programming skills with experience in performance profiling tools such as NVIDIA Nsight, Linux Perf, or similar diagnostic frameworks.
- Experience optimizing inference and/or training for large language models (LLMs) and transformer‑based architectures, including MoE models, at scale.
- Hands‑on experience with AWS Neuron SDK, or other non‑NVIDIA AI acceleration platforms.
- Track record of optimizing ML workloads across diverse hardware: embedded devices (ARM Cortex, DSPs, NPUs) and data center GPUs (NVIDIA Ampere/Hopper).
- Experience with low‑level optimization techniques including assembly‑level tuning (NVIDIA PTX, x86/ARM assembly) and cross‑platform kernel porting.
- Experience leading performance optimization initiatives that resulted in significant cost savings or multi‑million dollar business impact.
- Proven ability to mentor and train engineers in performance engineering and low‑level optimization (5+ team members or workshop instruction).
- Entrepreneurial experience or track record of driving technical vision in startup, co‑founder, or product development environments.
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice to know more about how we collect, use and transfer the personal data of our candidates.