Key Responsibilities
- Cloud Infrastructure Operations: Maintain and manage AWS services (Lambda, ECS, EKS, Redshift, Glue, SES, GuardDuty, etc.) in production, ensuring uptime, availability, and secure operations.
- Incident Management: Monitor infrastructure, manage alerts, and provide timely resolution of production incidents.
- Infrastructure-as-Code (IaC): Design and maintain infrastructure deployment pipelines using tools like Terraform, CloudFormation, and Ansible.
- Patch and Lifecycle Management: Oversee patch management for RHEL and Windows environments using AWS Patch Manager, WSUS, and YUM/DNF, ensuring compliance with security standards.
- SSL & EOL Management: Track SSL certificate renewals and manage end-of-life components like OS versions and Lambda runtimes.
- Tool Integration & Monitoring: Integrate and optimize observability tools such as NGINX and work with SRE teams to enhance infrastructure monitoring.
- Documentation & Reporting: Maintain accurate and up-to-date documentation (runbooks, change logs, post-mortems, and audit reports).
- Collaboration & Mentorship: Collaborate with cross-functional teams and mentor junior engineers in cloud operations and best practices.
- Security & Compliance: Ensure infrastructure adheres to strict security policies, compliance, and audit requirements.
- Continuous Improvement: Drive automation, performance optimizations, and proactive incident prevention to enhance overall cloud operations.
Key Requirements
- Education: Bachelor’s degree in Computer Science, Information Systems, or a related field.
- Experience: At least 6 years in DevOps/SRE roles, with a minimum of 4 years in public sector or regulated cloud environments.
- Cloud Expertise: Hands-on experience with AWS services in production, including services like Lambda, ECS, EKS, and more.
- IaC Skills: Proficiency in Terraform, CloudFormation, and Ansible for infrastructure automation.
- OS Administration: Strong administration skills in RHEL (v8→v9) and Windows Server (2016→2025).
- Patching Expertise: Experience managing patches across multiple operating systems using AWS Patch Manager, WSUS, and YUM/DNF.
- Security & Compliance: Knowledge in managing SSL certificates and end-of-life (EOL) remediation processes.
- Incident Management & Troubleshooting: Strong problem-solving and incident management skills with the ability to troubleshoot complex systems.
- Soft Skills: Excellent communication, collaboration, adaptability, time management, and continuous learning mindset.
To Apply, please kindly email your updated resume to weizhe.teoh@tg-hr.com
Regret to inform that only shortlisted candidates will be notified.
CEI: R25127749
EA License: 14C7275