Enable job alerts via email!

Site Reliability Engineering

INFINITE COMPUTER SOLUTIONS PTE LTD

Singapore

On-site

USD 60,000 - 100,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Managed Services Cross Technology Engineer (L2) SRE to ensure operational excellence in IT infrastructure. This role emphasizes automation, incident management, and monitoring to maintain high availability and performance. You will work with cutting-edge tools and practices to optimize systems and enhance client experiences. If you are passionate about bridging development and operations while advocating for best practices, this opportunity is perfect for you. Join a dynamic team committed to innovation and excellence in managed services.

Qualifications

  • Bachelor's degree or equivalent in IT/Computing or relevant experience.
  • Certifications in Microsoft, AWS, or VMware are a plus.

Responsibilities

  • Develop automation scripts to reduce manual intervention.
  • Set up monitoring and alerting using tools like Prometheus and Grafana.
  • Participate in on-call rotations and manage incident responses.

Skills

Infrastructure Monitoring & Observability
Incident & Problem Management
CI/CD and Deployment Automation
Linux Systems Administration
Cloud Platforms (AWS, GCP, Azure)
Scripting (PS, Bash, Python)
Disaster Recovery & High Availability
ITIL / SRE Best Practices

Education

Bachelor's degree in IT/Computing
Relevant Certifications (Microsoft, AWS, VMware)

Tools

Ansible
Puppet
Prometheus
Grafana
PagerDuty
Jenkins
GitLab CI/CD
Service Now

Job description

Job Description Summary

The Managed Services Cross Technology Engineer (L2) SRE is a developing engineering role, responsible for providing a managed service to clients to ensure that their IT infrastructure and systems remain operational.

Through the proactive monitoring, identifying, investigating, and resolving of technical incidents and problems, the Managed Services Cross Technology Engineer (L2) SRE is able to restore service to clients.

The primary objective of this role is to ensure that systems are reliable, scalable, and efficient, with minimal manual intervention.


From an operations perspective:

  • Ensuring High Availability and Uptime

Keep production systems running smoothly and within defined Service Level Objectives (SLOs).

Minimize downtime and reduce Mean Time to Recovery (MTTR) during incidents.

  • Automating Operations

Identify and eliminate manual, repetitive tasks (also called “toil”).

Build automation for deployment, monitoring, incident response, and infrastructure management.

  • Managing Incidents and On-Call

Respond quickly to outages and performance degradation.

Lead or participate in incident response, create postmortems, and implement preventative measures.

  • Monitoring and Observability

Set up monitoring, logging, and alerting tools to detect issues before they impact users.

Provide visibility into system health and performance.

  • Capacity Planning and Performance Optimization

Ensure systems can handle current and future loads.

Optimize infrastructure usage and reduce waste (cost-efficiency).

  • Bridging Development and Operations

Advocate for and implement DevOps and SRE best practices.

SREs make sure that production systems are always available, fast, and efficient—by combining software engineering with traditional IT operations practices.

This role may also contribute to / support on project work as and when required.

Job Description

Key Responsibilities:

  • Develop automation scripts to reduce manual intervention, cutting recurring operational toil
  • Set up and maintained monitoring and alerting using tools like Prometheus, Grafana, and PagerDuty.
  • Participate in on-call rotations, driving fast resolution of P1/P2 incidents and contributing to root cause analysis and postmortem documentation.
  • Development works on deployment pipelines (Jenkins, GitLab CI/CD).
  • Hardened security and compliance across production systems via configuration management and patching.
  • Proactively monitors the work queues.
  • Performs operational tasks to resolve all incidents/requests in a timely manner and within the agreed SLA.
  • Updates tickets with resolution tasks performed.
  • Identifies, investigates, analyses issues and errors prior to or when they occur, and logs all such incidents in a timely manner.
  • Captures all required and relevant information for immediate resolution.
  • Provides second level support to all incidents, requests and identifies the root cause of incidents and problems.
  • Communicates with other teams and clients for extending support.
  • Executes changes with clear identification of risks and mitigation plans to be captured into the change record.
  • Follows the shift handover process highlighting any key tickets to be focussed on along with a handover of upcoming critical tasks to be carried out in the next shift. If Applicable.
  • Escalates all tickets to seek the right focus from CoE and other teams, if needed continue the escalations to management.
  • Works with automation teams for effort optimization and automating routine tasks.
  • Ability to work across various other resolver group (internal and external) like Service Provider, TAC, etc.
  • Identifies problems and errors before they impact a client’s service.
  • Leads and manages all initial client escalation for operational issues.
  • Contributes to the change management process by logging all change requests with complete details for standard and non-standard including patching and any other changes to Configuration Items.
  • Ensures all changes are carried out with proper change approvals.
  • Plans and executes approved maintenance activities.
  • Audits and analyses incident and request tickets for quality and recommends improvements with updates to knowledge articles.
  • Produces trend analysis reports for identifying tasks for automation, leading to a reduction in tickets and optimization of effort.
  • May also contribute to / support on project work as and when required.
  • May work on implementing and delivering Disaster Recovery functions and tests.
  • Performs any other related task as required.

Knowledge and Attributes:

  • Ability to communicate and work across different cultures and social groups.
  • Ability to plan activities and projects well in advance, and takes into account possible changing circumstances.
  • Ability to maintain a positive outlook at work.
  • Ability to work well in a pressurized environment.
  • Ability to work hard and put in longer hours when it is necessary.
  • Ability to apply active listening techniques such as paraphrasing the message to confirm understanding, probing for further relevant information, and refraining from interrupting.
  • Ability to adapt to changing circumstances.
  • Ability to place clients at the forefront of all interactions, understanding their requirements, and creating a positive client experience throughout the total client journey.

Academic Qualifications and Certifications:

  • Bachelor's degree or equivalent qualification in IT/Computing (or demonstrated equivalent work experience).
  • Certifications relevant to the services provided (certifications carry additional weightage on a candidate’s qualification for the role).
  • Professional certifications include (but not limited to) -
    • Microsoft Certified
    • AWS Certified
    • VMware certified Professional
    • Google Cloud Platform (gcp)
    • VMWare Certified Cloud Management and Automation
    • SRE Certifications

Required Experience:

  • Infrastructure Monitoring & Observability & Telemetry
  • Incident & Problem Management
  • CI/CD and Deployment Automation
  • Linux Systems Administration
  • Configuration Management (Ansible, Puppet, etc.)
  • Cloud Platforms (AWS, GCP, Azure)
  • Scripting (PS, Bash, Python)
  • Disaster Recovery & High Availability
  • ITIL / SRE Best Practices
  • Familiar with JSON (Data formatting and processing)
  • Familiar with API, Automation, Ansible, CI/CD, etc.
  • Moderate level years of relevant managed services experience handling cross technology infrastructure.
  • Moderate level knowledge in ticketing tools preferably Service Now.
  • Moderate level working knowledge of ITIL processes.
  • Moderate level experience working with vendors and/or 3rd parties.

EA License # 14C6941

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.