Job Search and Career Advice Platform

Enable job alerts via email!

Engineer/Senior Engineer, AI Infrastructure (Perception & Planning)

BLACK SESAME TECHNOLOGIES (SINGAPORE) PTE. LTD.

Singapore

On-site

SGD 80,000 - 120,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A tech company in Singapore is seeking a highly skilled engineer to design and optimize the GPU/AI infrastructure for their Perception & Planning stack. The ideal candidate will have a strong ML/CV background and expert coding skills in C++ and Python. Responsibilities include architecting large-scale training pipelines, profiling and eliminating bottlenecks, implementing key performance components, and leading distributed training efforts. A Master's or Ph.D. in a relevant field is required, alongside experience with GPU profiling and tuning.

Qualifications

  • Master’s or Ph.D. in Computer Science, Electrical/Computer Engineering, or related technical discipline.
  • Strong foundation in ML/CV with proven experience in GPU/AI infrastructure and performance optimization.
  • Hands-on experience with GPU profiling and tuning.

Responsibilities

  • Architect and optimize large-scale training pipelines.
  • Profile end-to-end pipelines and eliminate bottlenecks.
  • Implement performance-critical components in CUDA/C++.
  • Tune GPU utilization and memory hierarchy.
  • Drive model conversion and deployment workflows.
  • Lead distributed training scaling and orchestration.
  • Build reliability and observability into systems.
  • Maintain benchmarks and profiling reports.

Skills

Expert-level coding in C++
Expert-level coding in Python
Strong foundation in ML/CV
GPU profiling
Performance optimization

Education

Master’s or Ph.D. in Computer Science, Electrical/Computer Engineering

Tools

CUDA
ONNX
TensorRT
NCCL
Job description
Position Overview:

We are looking for a highly skilled engineer to design and optimize the GPU/AI infrastructure behind our Perception & Planning stack, covering object detection, segmentation, depth estimation, and trajectory planning.

This role is technical: you will push the limits of GPU efficiency, distributed training, and real-time inference, turning state-of-the‑art research into production‑ready systems.

Responsibilities
  • Architect and optimize large-scale training pipelines with advanced techniques (FSDP/ZeRO-DP, tensor/pipeline parallelism, activation checkpointing, CPU/NVMe offloading, FlashAttention, mixed precision/bfloat16, comm/comp overlap).
  • Profile end‑to‑end pipelines (data → GPU kernels → inference) and eliminate bottlenecks using tools such as torch.profiler, Nsight Systems, Nsight Compute, TensorBoard Profiler, and low‑level debuggers (perf, NVTX/NCCL tracing).
  • Implement performance‑critical components in CUDA/C++ (custom kernels, TensorRT plugins, efficient memory layouts).
  • Tune GPU utilization, memory hierarchy (HBM, L2, shared), and communication efficiency (PCIe/NVLink/NCCL) to maximize throughput and minimize latency.
  • Drive model conversion and deployment workflows (ONNX/TensorRT, mixed precision, quantization) with strict real‑time FPS requirements.
  • Lead distributed training scaling and orchestration (multi‑node DDP/FSDP, NCCL tuning, experiment automation).
  • Build reliability and observability into systems with low‑overhead logging, metrics, and health monitoring.
  • Maintain benchmarks, profiling reports, and best‑practice documentation to guide the team.
Qualifications
  • Master’s or Ph.D. in Computer Science, Electrical/Computer Engineering, or related technical discipline.
  • Strong foundation in ML/CV with proven experience in GPU/AI infrastructure and performance optimization.
  • Expert‑level coding in C++ and Python; ability to implement, debug, and optimize CUDA kernels.
  • Hands‑on experience with GPU profiling and tuning, with a track record of improving throughput, utilization, and memory efficiency.
  • Familiarity with ONNX, TensorRT, NCCL, and other performance‑oriented frameworks and libraries.
  • Demonstrated success deploying real‑time inference systems on GPUs/edge devices.
  • Strong problem‑solving, debugging, and performance‑analysis skills; thrives in low‑level, high‑performance system challenges.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.