Overview
At Xtremax, our Operations Support Engineers play a key role in ensuring the reliability, stability, and efficiency of mission-critical systems. In this role, you’ll work closely with developers, product managers, and user support teams to monitor system performance, resolve technical issues, and implement preventive measures. Your contributions will directly support smooth operations, high uptime, and reliable services for our users. Candidates with public sector experience are preferred, as this role supports IT projects for government agencies.
Responsibilities
System Monitoring & Performance
- Monitor and analyse product runtime environments (production and non-production) to ensure optimal system performance.
- Implement continuous improvement strategies to enhance system reliability and efficiency.
Incident & Problem Management
- Manage application and security incidents, performing problem determination and coordinating with internal teams and vendors for resolution.
- Escalate issues as necessary to minimize business impact.
Operational Processes & Compliance
- Develop and maintain operations and process guides to meet audit and compliance requirements.
- Handle day-to-day operational activities, analyse performance data, and prepare status reports for stakeholders and management.
- Ensure operational processes align with IM8 and ISO 27001 standards.
- Conduct periodic compliance drills and support audit preparation.
Team Coordination & Support
- Lead and coordinate with operations teams and vendors to ensure 24/7 system support availability.
- Facilitate communication between teams to resolve operational issues efficiently.
Automation & Proactive Operations
- Build self-healing systems with automated remediation for common alerts.
- Implement Infrastructure as Code (IaC) pipelines to reduce manual configuration drift.
Observability & Incident Readiness
- Deploy full-stack monitoring with predictive analytics (CloudWatch Anomaly Detection, Stackdriver, Azure Monitor).
- Integrate alerting with central NOC/SOC for faster escalation and resolution.
Collaboration & Enablement
- Serve as the bridge between app teams and infra teams, enabling self-service for troubleshooting.
- Train agency teams on operational best practices and tool adoption (e.g., ITSM workflows, DevOps pipelines).
Must Have
- A Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Must have 2–5 years of relevant experience.
- Proven experience as an Operations Engineer or in a similar IT role.
- Familiarity with ITSM tools (e.g., Remedy, Zendesk, ServiceDesk) for change and incident management workflows.
- Experience in implementing security and access controls for production and test environments.
- Proficiency with full stack monitoring tools (e.g., APM tools, CloudWatch, Stackdriver, OpenAPM stack).
- Experience with automation tools (e.g., Terraform, Ansible) to minimize downtime and reduce human error.
- Knowledge of agile methodologies, DevOps pipelines, test-driven development, and information security practices.
- Cloud infrastructure experience
- Strong problem-solving and communication skills, with the ability to explain complex issues to non-technical audiences.
- A collaborative, resourceful mindset with the ability to deliver innovative solutions.
- Experience with Linux and Windows admin.
Good to Have
- Experience with Singapore Government Project will be advantageous.
- Database experience and scripting experience (Shell script / PowerShell / Python) are an advantage.
Certificate Preferred
- AWS Certified DevOps Engineer – Professional
- Microsoft Certified: Azure DevOps Engineer Expert
- Google Professional Cloud DevOps Engineer
- HashiCorp Certified: Terraform Associate
- Certified Kubernetes Administrator (CKA)
- ITIL 4 Foundation