Responsible for the installation, configuration, debugging, and operation and maintenance of physical servers to ensure their stable operation.
Responsible for monitoring and optimizing server hardware resources (CPU, GPU, video memory, memory, disk, network, etc.), timely discovering and solving performance bottlenecks.
Responsible for the daily maintenance of the Linux server, system updates, patch installation and kernel upgrades.
Write and optimize automated operation and maintenance tools to improve deployment, monitoring, and management efficiency.
Develop and implement emergency plans for server failures, quickly locate and resolve various server failures, and ensure business continuity.
Analyze the cause of the failure and generate a report to promote system stability improvement.
Responsible for server security reinforcement, permission control, log auditing, and security policy configuration to ensure system security.
Cooperate with the security team to complete vulnerability scanning, intrusion detection, and emergency response to security incidents.
Write and maintain server configurations, operation manuals, troubleshooting procedures and other technical documents.
Bachelor's degree or above, major in computer science, network engineering, information security, or related fields.
More than 3 years of server operation and maintenance experience, with experience in large-scale Internet, Cloud Service or Data center environments preferred.
Familiar with Linux (CentOS, Ubuntu, Red Hat), with experience in system tuning and performance optimization.
Proficient in automated scripting tools such as Shell, Python, Perl, Ansible, etc., with the ability to automate operation and maintenance.
Familiar with virtualization or container technologies such as Docker, Kubernetes, VMware, OpenStack, etc.
Familiar with storage technologies such as RAID, LVM, NFS, iSCSI, and have experience in storage system operation and maintenance.
Possess excellent analytical and troubleshooting abilities, able to respond quickly and solve complex problems.
Possess good communication skills and teamwork spirit, able to work closely with development, network, and security teams.
Priority will be given to those who hold certifications such as RHCE, LPIC, MCSE, etc.
Candidates with experience in using monitoring platforms such as Prometheus, Grafana, and Zabbix are preferred.
Familiarity with GPU server operation and maintenance or high-performance computing (HPC) environment is preferred.