Following the creation of a new internal structure, we are looking for an experienced Site Reliability Engineer (SRE) to join our Infrastructure team.
Responsibilities:
- System Reliability: Ensuring the reliability and availability of our platforms and technological systems through robust monitoring, reporting, and incident response procedures.
- Infrastructure Automation: Automating the deployment, scaling, and management of services and infrastructure components for critical applications like digital channels and branches.
- Resource Planning: Collaborating with cross-functional teams to forecast and plan future resource requirements for all infrastructure systems.
- Performance Optimization: Analyzing platform performance to improve efficiency, ensuring an optimal experience for users and end customers.
- Incident Management Support: Participating in troubleshooting sessions, supporting operational and application teams, analyzing monitoring data and root causes, and proposing solutions.
- Security: Supporting implementation and maintaining security best practices, participating in vulnerability assessments and threat mitigation.
- Continuous Improvement: Improving system reliability through root cause analysis, incident reporting, and proactive maintenance and evolution of systems and platforms.
Required Experience:
- Excellent knowledge of Terraform and Ansible
- Understanding of containerization technologies (e.g., Docker, containerd)
- Expertise in Kubernetes management and components (e.g., ingresses, monitoring stacks, custom autoscalers)
- Strong troubleshooting skills
- Understanding of delivery systems (e.g., Helm, GitOps)
- Knowledge of at least one major cloud provider
- Scripting and programming skills (e.g., Bash, Python, Go)
- Understanding of networking
- Experience with databases like Oracle DB, MongoDB, PostgreSQL
Nice to Have:
- Experience with GCP, AWS, Azure
- Experience with distributed systems such as caching systems (e.g., Redis), message brokers (e.g., RabbitMQ), log collection systems (e.g., ELK)
What We Offer:
- Autonomy and responsibility: freedom to choose, try, fail, and learn
- Career growth: evaluations every six months to guide your development
- Continuous training: access to courses and industry expert learning opportunities
Location: Reggio Emilia, Italia