Enable job alerts via email!

System Operations Engineer (Data Center)

Neuron Solutions Sdn Bhd

Johor Bahru

On-site

MYR 60,000 - 80,000

Full time

3 days ago
Be an early applicant

Job summary

A technology company in Johor Bahru seeks a data center system operation engineer. The role involves overseeing GPU clusters, ensuring system efficiency, and resolving incidents. Candidates should have a Bachelor’s in Computer Science and 2+ years in system operations. Strong familiarity with GPU hardware and networking concepts is preferred. The company values cross-functional teamwork and systems documentation.

Qualifications

  • 2+ years of experience in system operations within IT infrastructure or cloud services.
  • Hands-on experience in IT hardware replacement.
  • Experience in data center operations, system administration, or a similar role.

Responsibilities

  • Oversee daily operations of GPU clusters and data center systems.
  • Monitor system health, performance, and capacity using standard tools.
  • Respond to and resolve operational incidents.

Skills

Familiarity with GPU hardware
Experience with storage systems
Knowledgeable in operating ticketing system
Networking concepts
Experience in managing bare metal servers

Education

Bachelor’s degree in Computer Science or related field

Tools

Prometheus
Grafana

Job description

Seeking a data center system operation engineer to join our team to support the daily operation in a state-of-the-art GPU cluster. This role is to ensure the reliability, scalability, and efficiency of our data center operations, supporting high-performance GPU infrastructure for cutting-edge AI workloads.

Key Responsibilities

• Oversee daily operations of GPU clusters and data center systems.

• Monitor system health, performance, and capacity using industry-standard tools and frameworks.

• Respond to and resolve operational incidents, ensuring minimal downtime and maximum availability.

• Manage the deployment, configuration, and optimization of GPU servers, network devices, and supporting infrastructure (e.g. CPU servers and storage).

• Perform hardware diagnostics and preventative maintenance for GPU servers, storage, and networking equipment.

• Troubleshoot system issues related to hardware, operating systems, and applications.

• Work closely with cross-functional teams, including network engineers, system administrators, and developers, to support AI workloads.

• Maintain accurate documentation for system configurations, processes, and incident reports.

• Implement and enforce security best practices in system operations.

• Identify and propose improvements to enhance system performance, reduce costs, and optimize resource utilization.

Desired Skills

• Familiarity with GPU hardware (e.g., NVIDIA GPUs) and AI/ML workloads is a strong advantage.

Experience with storage systems (e.g., NVMe, SAN, NAS), networking concepts, and protocols (e.g., TCP/IP, RDMA) will be advantageous.

• Knowledgeable in operating ticketing system and troubleshooting process in CPU/GPU cluster.

• Familiarity with networking concepts, including TCP/IP, VLANs, and load balancing.

• Experience in managing bare metal servers, GPU infrastructure, or high-performance computing systems will be an added advantage

Qualifications

• Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent experience will be considered.

• 2+ years of experience in system operations within IT infrastructure or cloud services.

• Hands-on experience in IT hardware replacement.

• Experience in data center operations, system administration, or a similar role.

• Knowledge of server hardware, including GPU cards, CPU configurations, and storage solutions.

• Understanding of Linux fundamentals and Kubernetes environments.

• Familiarity with monitoring tools (e.g., Prometheus, Grafana) and logging frameworks.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.