Role & responsibilities
- Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
- Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
- Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
- Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
- Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
- Collaborate with development teams to ensure infrastructure aligns with application requirements and follows best practices for performance, security, and cost efficiency.
- Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
- Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
- Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.
- Stay up to date with trends and advancements in cloud computing, containerization, and DevOps methodologies to drive improvements in our technology stack.
Preferred candidate profile
- 6 -10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
- Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
- Experience with multi-cloud environments, including Azure and GCP, is highly desirable.
- Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
- Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
- Extensive experience with CI/CD tools and practices, with hands-on expertise in automating infrastructure (EKS, ALB, NLB, Route 53, WAF, Network components) and application deployments.
- Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
- In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
- Preferred knowledge of Content Delivery Networks (CDNs) for optimizing application performance and scalability.
- Strong troubleshooting and problem-solving skills, with a proactive approach to incident management and root cause analysis.
- Strong application knowledge, including building and deploying Java Spring Boot and Angular applications.
- Experience in setting up unit tests and code quality tools, such as SonarQube, to ensure robust application development
- Proven ability to work independently and lead initiatives while collaborating with cross-functional teams.
- Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.