Site Reliability Engineer (Linux Kernel, Kubernetes, Cloud, Automation, Networking).
EXASOFT CONSULTING PTE. LTD.
Singapore
On-site
SGD 120,000 - 150,000
Full time
Job summary
A financial technology firm in Singapore seeks a Senior Systems Engineer with over 10 years of experience in system administration for financial markets infrastructure. The role encompasses developing performance-oriented solutions, managing hybrid cloud setups, and employing automation tools like Kubernetes and Ansible. Candidates must possess advanced Linux skills and a strong knowledge of cloud operations. Competitive salary and benefits are offered.
Qualifications
- 10+ years of experience in system administration and performance engineering.
- Advanced proficiency in Linux internals and kernel performance tuning.
- Hands-on experience with Kubernetes and Docker for automation.
Responsibilities
- Develop performance-critical infrastructure for financial markets.
- Build high-availability environments using Kubernetes and Docker.
- Manage hybrid cloud infrastructure with strict performance SLAs.
Skills
Linux kernel expertise
Kubernetes
Docker
Ansible
Bash
Python
AWS
Azure
GCP
Networking protocols
Tools
ELK Stack
Grafana
Splunk
VMware
Responsibilities
- Develop and oversee performance-critical infrastructure for financial markets, ensuring maximum throughput, high resiliency, and minimal operational risk.
- Leverage deep Linux kernel expertise to fine-tune scheduling policies, interrupt routing, and NUMA resource allocation, ensuring predictable performance at scale.
- Build and maintain high-availability containerized environments using Kubernetes, Docker, and advanced orchestration tools with a strong focus on scalability and security.
- Lead automation initiatives with Ansible, Bash, and Python, eliminating manual intervention and improving system efficiency.
- Manage hybrid cloud infrastructure (AWS, Azure, GCP) with strict performance SLAs, security compliance, and cost-optimized deployments.
- Oversee infrastructure monitoring and observability using ELK Stack, Grafana, Site24x7, Splunk, and other enterprise-grade tools, ensuring proactive incident detection and resolution.
- Administer and troubleshoot enterprise storage and networking stacks like RAID, NFS, SAN/NAS, TCP/IP networking,VMware/vCenter, BigIP load balancers.
- Collaborate with development, DevOps, and security teams to design fault-tolerant systems and enforce infrastructure governance policies.
- Execute predictive capacity modeling, OS hardening and patch compliance, coupled with benchmark-driven performance optimization for trading and real-time compute platforms.
- Provide expert-level outage resolution, coordinating cross-functional teams to deliver sustainable remediation and operational resilience.
Requirements
- 10+ years of progressive experience in system administration, performance engineering, and reliability operations across enterprise and financial domains.
- Advanced proficiency in Linux internals with specialization in kernel performance tuning, NUMA-aware optimizations, and real-time workload handling.
- Proven hands-on experience with Kubernetes, Docker, and Ansible for large-scale automation and orchestration.
- Strong scripting/programming in Bash, Python, and experience with perf/eBPF for system analysis.
- Demonstrated expertise in cloud operations across AWS, Azure, and GCP.
- Strong background in networking protocols (TCP/IP, FIX) and high-performance trading environments.
- Familiarity with storage systems (SAN, NAS, RAID) and database tuning (MySQL optimization).
- Experience implementing observability and monitoring solutions like ELK, Grafana, Splunk, Corvil.