Enable job alerts via email!
A technology company in Johor Bahru seeks a data center system operation engineer. The role involves overseeing GPU clusters, ensuring system efficiency, and resolving incidents. Candidates should have a Bachelor’s in Computer Science and 2+ years in system operations. Strong familiarity with GPU hardware and networking concepts is preferred. The company values cross-functional teamwork and systems documentation.
Seeking a data center system operation engineer to join our team to support the daily operation in a state-of-the-art GPU cluster. This role is to ensure the reliability, scalability, and efficiency of our data center operations, supporting high-performance GPU infrastructure for cutting-edge AI workloads.
Key Responsibilities
• Oversee daily operations of GPU clusters and data center systems.
• Monitor system health, performance, and capacity using industry-standard tools and frameworks.
• Respond to and resolve operational incidents, ensuring minimal downtime and maximum availability.
• Manage the deployment, configuration, and optimization of GPU servers, network devices, and supporting infrastructure (e.g. CPU servers and storage).
• Perform hardware diagnostics and preventative maintenance for GPU servers, storage, and networking equipment.
• Troubleshoot system issues related to hardware, operating systems, and applications.
• Work closely with cross-functional teams, including network engineers, system administrators, and developers, to support AI workloads.
• Maintain accurate documentation for system configurations, processes, and incident reports.
• Implement and enforce security best practices in system operations.
• Identify and propose improvements to enhance system performance, reduce costs, and optimize resource utilization.
Desired Skills
• Familiarity with GPU hardware (e.g., NVIDIA GPUs) and AI/ML workloads is a strong advantage.
• Experience with storage systems (e.g., NVMe, SAN, NAS), networking concepts, and protocols (e.g., TCP/IP, RDMA) will be advantageous.
• Knowledgeable in operating ticketing system and troubleshooting process in CPU/GPU cluster.
• Familiarity with networking concepts, including TCP/IP, VLANs, and load balancing.
• Experience in managing bare metal servers, GPU infrastructure, or high-performance computing systems will be an added advantage
Qualifications
• Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent experience will be considered.
• 2+ years of experience in system operations within IT infrastructure or cloud services.
• Hands-on experience in IT hardware replacement.
• Experience in data center operations, system administration, or a similar role.
• Knowledge of server hardware, including GPU cards, CPU configurations, and storage solutions.
• Understanding of Linux fundamentals and Kubernetes environments.
• Familiarity with monitoring tools (e.g., Prometheus, Grafana) and logging frameworks.