Senior Site Reliability Engineer
Menteng, Jakarta – Salary IDR120,000,000 to IDR360,000,000 – PT ALTO Network
Menteng, Jakarta – Salary IDR20,000,000 to IDR25,000,000 – BookCabin
Posted today
Job Description
- Demonstrate excellent change management in implementing changes safely and efficiently in the production environment.
- Demonstrate excellent incident & problem response and resolve the issue within SLA.
- Demonstrate excellent service request handling from other parties within SLA.
- Demonstrate excellent efficiency in automating tasks and reducing manual effort.
- Demonstrate excellent implementation of a comprehensive monitoring system to detect issues early and proactively react.
- Demonstrate excellent curiosity to find out and review root cause analysis.
- Demonstrate excellent reviewing system performance and come up with an action plan.
- Demonstrate excellent problem‑solving and come up with an action plan.
- Demonstrate excellent reviewing change activity in production.
- Responsible on handling incident & problem resolution.
- Enable automation processes on each product.
- Understand customer (internal & external) needs and deliver the expected outcomes.
- Execute of plans and strategies.
- Faster a customer‑focused working environment with clear responsibilities and expectation.
- Creating and execution deployment strategy.
- Establish and maintain active and constructive relationships with other team in the organization (internal).
- Risk/Findings audit to be fulfilled.
- Perform support good corporate governance in their specific areas of work.
Responsibilities
- Design, build, and maintain scalable, reliable, and secure infrastructure across AWS (including Elastic Beanstalk) and Azure.
- Develop and manage CI/CD pipelines using Azure DevOps, GitHub Actions, or similar tools to ensure smooth and automated deployments.
- Operate, monitor, and troubleshoot Kubernetes clusters (EKS, AKS, or self‑managed) to ensure system stability and uptime.
- Implement comprehensive observability solutions using Prometheus, Grafana, Loki, and Alertmanager.
- Automate infrastructure provisioning and configuration using Terraform, Helm, CloudFormation, and/or Ansible.
- Define, measure, and improve system reliability through SLOs, SLIs, and SLAs.
- Enhance system resilience and incident response through proactive monitoring and capacity planning.
- Manage secrets, access control, and security policies to maintain a robust and compliant infrastructure.
- Participate in on‑call rotations, respond to incidents, and drive root cause analysis and post‑incident reviews.
- Collaborate closely with development teams to embed reliability and scalability best practices throughout the software lifecycle.
Qualifications
- Proven experience as SRE / IT Support / Application Support / System Engineer or similar position at least 5 years’ experience.
- Has CKA (Certified Kubernetes Administrator) would be plus.
Knowledge
- ISO 8583
- RestAPI
- Networking
- Postman / API testing
- ITIL v4 / IT Service Management
- Agile Methodology
- Financial Digital Product (Biller, disbursement, virtual account, QR)
Non-Technical
- Reporting and emergency response planning
- Strong relationship management
- Excellent communication and interpersonal skills
- Strong motivational and empowerment skills
- Commitment and reliable
- Outstanding organizational and leadership skills
- Take initiative and remain calm under pressure
Technical
- Docker
- Windows
- SQL Query
- CI/CD
- Scripting (bash/python)