Enable job alerts via email!

Lead Systems Engineer (DevOps & SRE)

EPAM Systems

Singapore

On-site

SGD 90,000 - 120,000

Full time

5 days ago
Be an early applicant

Job summary

A leading technology solutions firm in Singapore is looking for a Lead Systems Engineer to ensure the reliability and scalability of their infrastructure. This role involves implementing CI/CD pipelines, mentoring teams, and collaborating on best practices. The ideal candidate has over 8 years of experience in a DevOps/SRE role with strong cloud and containerization expertise. This position offers competitive compensation and opportunities for professional growth.

Qualifications

  • 8+ years of experience in a DevOps/SRE role.
  • Strong experience with cloud platforms (AWS, GCP, Azure).
  • Proficiency in infrastructure as code (IaC) tools (Terraform, CloudFormation, etc.).
  • Extensive experience with containerization and orchestration (Docker, Kubernetes).
  • Strong knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI, etc.).
  • Proficiency in scripting languages (Python, Bash, etc.).
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.).
  • Excellent problem-solving skills and the ability to work under pressure.
  • Strong communication and collaboration skills.
  • B2+ English level proficiency.

Responsibilities

  • Lead the design, development, and maintenance of scalable infrastructure.
  • Implement and manage CI/CD pipelines.
  • Monitor system performance and reliability.
  • Develop and maintain automation tools.
  • Collaborate with development teams on best practices.
  • Ensure security and compliance across operations.
  • Mentor junior SREs and DevOps engineers.
  • Conduct root cause analysis of system failures.
  • Optimize resource utilization for cost-effective operations.

Skills

CI/CD
Jenkins
Docker
Kubernetes
Terraform
Ansible
Python
Prometheus
Grafana
ELK stack
Splunk
Dynatrace
Datadog

Job description

    Join our organization as a Lead Systems Engineer (DevOps & SRE) and play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies, and will lead the design, development, and maintenance of scalable and reliable infrastructure. You will also be responsible for implementing and managing CI/CD pipelines, monitoring system performance and reliability, developing and maintaining automation tools, ensuring security and compliance, mentoring and guiding junior SREs and DevOps engineers, and staying up-to-date with the latest industry trends and technologies.ResponsibilitiesLead the design, development, and maintenance of scalable and reliable infrastructure. Implement and manage CI/CD pipelines to ensure efficient and smooth software releases. Monitor system performance and reliability, proactively identifying and resolving issues. Develop and maintain automation tools to streamline infrastructure management and deployment processes. Collaborate with development teams to ensure best practices for software development, deployment, and operations. Ensure security and compliance across all infrastructure and operations. Mentor and guide junior SREs and DevOps engineers, fostering a culture of collaboration and continuous learning. Conduct root cause analysis of system failures and implement solutions to prevent recurrence. Optimize resource utilization to ensure cost-effective operations. Stay up-to-date with the latest industry trends and technologies, integrating them into our processes where appropriate.Requirements8+ years of experience in a DevOps/SRE role. Strong experience with cloud platforms (AWS, GCP, Azure). Proficiency in infrastructure as code (IaC) tools (Terraform, CloudFormation, etc.). Extensive experience with containerization and orchestration (Docker, Kubernetes). Strong knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI, etc.). Proficiency in scripting languages (Python, Bash, etc.). Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.). Ability to participate in capacity planning and scalability assessments to support business growth and requirements. Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations and provide on-call support and participate in incident management & response activities as needed. Solid understanding of networking and security principles. Excellent problem-solving skills and the ability to work under pressure. Strong communication and collaboration skills. B2+ English level proficiency.TechnologiesCI/CD, Jenkins, Docker, Kubernetes, Terraform, Ansible, Python, Prometheus, Grafana, ELK stack, Splunk, Dynatrace, Datadog or similar, SLI, SLO, SLA, and Error Budget concepts.,

Sign-in & see how your skills match this job

Sign-in & Get noticed by top recruiters and get hired fast

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.