We are looking for a skilled Site Reliability Engineer to join our client's global SRE Team in Singapore.
Responsibilities:
Overseeing and ensuring the continuous operation of the firm's Linux-based trading infrastructure, addressing day-to-day operational needs.
Providing second level support, including:
Rapid response to emergencies.
Implementing scheduled updates and deployments.
In-depth analysis and resolution of performance issues.
Engaging in a rotational on-call schedule, including early morning and weekend shifts, to provide timely support.
Contributing towards the development of automated solutions for server provisioning, configuration, and monitoring, targeting scalable management of thousands of servers.
Engaging in interactions with the Trading and Core Engineering teams.
Managing essential Core services such as DHCP, LDAP, DNS, and NFS for on-prem and hosted data centers as well as public clouds.
Participating in an on-call rotation and occasional weekend shifts.
Qualifications:
Sound expertise in Linux production environments.
Basic knowledge of Python and Bash scripting.
Engagement with automation and monitoring tool sets.
Comprehensive knowledge of operating system principles, with a particular focus on Linux internals.
Familiarity with Intel-based server hardware and components.
Competence in server-side networking, including understanding network protocols and configurations.
Familiarity with cloud services and architectural solutions.
Experience in designing, building, and troubleshooting complex systems.
Good problem-solving skills, underpinned by a methodical approach to technical challenges. This includes an ability to communicate effectively, demonstrating strong interpersonal skills, a sense of responsibility, and a commitment to driving projects to completion.