Job Search and Career Advice Platform

Enable job alerts via email!

Test Software Development Manager

Overclock

Seberang Perai

On-site

MYR 100,000 - 140,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A tech company in Penang is seeking a Test Software Development Manager to lead the design, development, and maintenance of diagnostic software and validation frameworks for AI servers. Candidates should have over 7 years of experience in software development, strong debugging skills, and technical leadership abilities. The role involves defining test strategies, overseeing testing processes, and ensuring team competency through structured training. This position offers the chance to work on cutting-edge technologies in a dynamic environment.

Qualifications

  • 7+ years in software development, with at least 2–3 years in a lead role.
  • Hands-on experience with Linux, Python, and CI/CD for hardware validation.
  • Exposure to GPU/AI server platforms or diagnostic tools.

Responsibilities

  • Define the overall test software architecture for AI servers.
  • Oversee development of diagnostic software and DiagOS.
  • Lead a team of test software/automation engineers.

Skills

Test automation frameworks
Debugging and root-cause analysis
Team leadership
Clear communication

Education

Bachelor’s or Master’s degree in Computer Science or related field

Tools

Linux
Python
CI/CD pipelines
Job description

The Test Software Development Manager leads the team responsible for designing, developing, and maintaining the diagnostic software, DiagOS, and automated validation framework used to qualify AI servers and racks. This role owns the full test software stack from low‑level hardware access and NVIDIA SDK‑based diagnostics to large‑scale automated system validation in the lab.

The manager will define the test software architecture, translate product and hardware requirements into test strategies, and coach engineers through a structured training plan covering hardware architecture, firmware & diagnostic software development, and test/automation practices.

Key Responsibilities
  • 1. Technical Ownership & Strategy

    • Define the overall test software architecture for AI servers and racks, including DiagOS, diagnostic toolchains, and automation framework.
    • Set the validation strategy and coverage targets across:
      • Hardware architecture & components
      • Test, validation, and automation processes
    • Work with System, Networking, Thermal and R&D teams to ensure tests reflect real‑world workloads and customer use cases.
  • 2. Diagnostic Software & DiagOS Leadership

    • Oversee development and maintenance of DiagOS and board support packages (BSPs), including drivers and tools required for validation.
    • Guide the use of NVIDIA SDKs & toolkits (CUDA, NVML, DCGM) and hardware‑level diagnostic tools (i2c‑tools, pciutils, fio, iperf3, stress‑ng, memtest86+).
    • Ensure robust, reusable diagnostic libraries and command‑line tools exist for GPU, CPU, memory, storage, networking, and management subsystems.
    • Own the automation framework (Python/pytest or equivalent) that orchestrates tests across the lab: BMC, OS, network, PDUs, console, and PXE provisioning.
    • Define and maintain CI/CD integration (Jenkins/GitLab) for “hardware‑in‑the‑loop” pipelines that run on real DUTs after each software change.
    • Ensure the lab testbed (orchestrators, management & data switches, console servers, PDUs, provisioning server) is fully integrated into the automation framework.
  • 4. Validation Scope & Quality

    • Define and review test plans for:
      • Firmware & bring‑up validation (BIOS/BMC, Redfish/IPMI, sensors, PMBus).
      • Component & performance validation (CPU/memory stress, NVMe, RDMA, GPU subsystem tests using CUDA samples, DCGM, NCCL).
      • System integration & resilience (HPL, MLPerf, fault injection scenarios, PSU/network pull tests, power capping, thermal stress).
    • Set pass/fail criteria, debug workflows, and reporting standards for test results and defects.
    • Work closely with Quality and NPI to ensure test coverage supports manufacturing release and customer acceptance.
    • Lead a team of test software/automation engineers and validation engineers, assigning tasks and reviewing technical output.
    • Use the structured training curriculum as a roadmap to grow junior engineers’ competency in:
      • Hardware architecture and component “encyclopedia”.
      • Firmware & DiagOS, diagnostic toolkits.
      • Test automation, CI/CD, and failure analysis.
    • Provide coaching on debugging methodology, code quality, documentation, and test design.
    • Drive a culture of discipline, learning, and continuous improvement in the team.
  • 6. Cross‑Functional & Partner Collaboration

    • Interface with Head of R&D, System Architects, GPU partners, and Manufacturing teams to align test requirements with product roadmaps.
    • Collaborate with Production and Supply Chain where needed to support burn‑in and manufacturing test strategies.
    • Communicate risks, test gaps, and readiness status clearly to management and stakeholders.
  • 7. Process, Compliance & Documentation

    • Define and enforce development processes: version control practices, code review, test documentation, and release procedures for test tools.
    • Ensure all test tools and frameworks are properly documented, versioned, and auditable.
    • Support internal/external audits and customer reviews with clear evidence of validation coverage and methodology.
Qualifications
  • Education

    Bachelor’s or Master’s degree in Computer Science, Software Engineering, Electrical/Computer Engineering, or related field.

  • Experience

    7+ years in software development, test automation, or validation engineering, with at least 2–3 years in a lead or managerial role.

    Hands‑on experience with Linux, Python (or similar scripting), and CI/CD pipelines for hardware validation.

    Exposure to GPU/AI server platforms, diagnostic tools, or data‑center/server validation is strongly preferred.

    Strong technical depth in at least two of:

    • Test automation frameworks (pytest or similar).
    • Diagnostic or validation tools for CPU/GPU/network/storage.

    Familiarity with some of the toolchain and concepts in the training plan, e.g. CUDA/NVML/DCGM, i2c‑tools, pciutils, fio, iperf3, stress‑ng, memtest86+, Redfish/IPMI, PXE provisioning, BMC/BIOS basics.

    Proven ability to design test strategies and frameworks, not just write individual test cases.

    Strong debugging and root‑cause analysis mindset.

    Good people leadership skills: coaching, delegation, feedback, and performance management.

    Clear written and verbal communication for collaboration with cross‑functional teams and external partners.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.