Site Reliability Engineer (Linux/Kernel)
Job Description:
We are looking for a skilled Site Reliability Engineer to join our client's global SRE Team in Singapore.
Responsibilities:
- Overseeing and ensuring the continuous operation of the firm's Linux-based trading infrastructure, addressing day-to-day operational needs.
- Providing second level support, including:
- Rapid response to emergencies.
- Implementing scheduled updates and deployments.
- In-depth analysis and resolution of performance issues.
- Engaging in a rotational on-call schedule, including early morning and weekend shifts, to provide timely support.
- Contributing towards the development of automated solutions for server provisioning, configuration, and monitoring, targeting scalable management of thousands of servers.
- Engaging in interactions with the Trading and Core Engineering teams.
- Managing essential Core services such as DHCP, LDAP, DNS, and NFS for on-prem and hosted data centers as well as public clouds.
- Participating in an on-call rotation and occasional weekend shifts.
Qualifications:
- Sound expertise in Linux production environments.
- Basic knowledge of Python and Bash scripting.
- Engagement with automation and monitoring tool sets.
- Comprehensive knowledge of operating system principles, with a particular focus on Linux internals.
- Familiarity with Intel-based server hardware and components.
- Competence in server-side networking, including understanding network protocols and configurations.
- Familiarity with cloud services and architectural solutions.
- Experience in designing, building, and troubleshooting complex systems.
- Good problem-solving skills, underpinned by a methodical approach to technical challenges. This includes an ability to communicate effectively, demonstrating strong interpersonal skills, a sense of responsibility, and a commitment to driving projects to completion.