The purpose of the Site Reliability Engineer (SRE) role is to enhance and maintain the high availability and reliability of systems and applications, ensuring they effectively support business operations and contribute to a positive user experience. This role combines practices from software engineering and operations to create robust and efficient systems. Responsibilities include :
- Enhancing system availability and reliability.
- Managing incidents proactively to minimize downtime.
- Optimizing system performance and scalability.
- Implementing automation to improve operational efficiency.
- Collaborating with security teams to enhance system protection.
- Maintaining documentation for knowledge sharing.
- Working with development teams to embed reliability in design.
- Continuously evaluating and optimizing system performance and processes.
- Supporting business growth through reliable infrastructure.
What does he / she do? (tasks) :
- Participate in architecture decisions to ensure system resiliency from the start of software development.
Automation and Orchestration :
- Develop scripts and use tools to automate deployment, infrastructure provisioning, configuration management, and scaling, using CI / CD practices.
- Orchestrate workflows across environments to ensure consistency and reliability.
CI / CD :
- Design, implement, and manage CI / CD pipelines for rapid, reliable code deployment with minimal manual intervention, including automated testing.
Infrastructure as Code :
- Promote the use of IaC tools and practices for reproducible, scalable, and maintainable environments.
Monitoring, Logging, and Alerting :
- Implement monitoring and logging solutions to analyze performance data and generate alerts.
- Use observability data to proactively resolve issues, ensuring high availability.
Performance Optimization :
- Regularly assess and optimize system response times, resource use, and user satisfaction.
Incident Management and Reliability Engineering :
- Participate in on-call rotations, resolve incidents swiftly, and conduct post-mortems to prevent recurrence.
- Develop resilience and recovery strategies to meet SLOs.
Security and Compliance :
- Ensure adherence to security and compliance standards in all operations.
- Conduct security audits and address vulnerabilities.
Quality Assurance (QA) :
- Support QA by setting up environments and deploying tools.
- Collaborate to automate testing and evaluate non-functional testing outcomes.
Responsibilities
- Design, build, and scale systems using automation; develop automation scripts.
- Lead incident management, conduct post-mortems, and develop preventive strategies.
- Define and monitor reliability metrics; analyze data for improvements.
- Collaborate with development teams to ensure reliability from design phase; promote SRE principles.
- Lead capacity planning and scalability strategies.
- Identify inefficiencies and champion new technologies.
- Strengthen system security through initiatives and vulnerability management.
Mandatory Skills :
- Monitoring, Logging, Observability : Advanced strategies.
- Automation : Proficiency in Python and Bash.
- Configuration as Code : Advanced skills in Ansible.
- Containerization and Orchestration : Intermediate Docker and basic Kubernetes.
- Databases : Advanced management of relational and non-relational databases.
- Version Control : Advanced Git proficiency.
Recommended Skills :
- Infrastructure as Code : Advanced Terraform skills.
- Programming : Proficient in Java, Spring Boot.
- Cloud Platforms : Advanced knowledge.
- Networking and Security : Advanced understanding.
- Databases : Advanced management.
- CI / CD : Knowledge and experience.
Soft Skills
- Effective communication.
- Teamwork and collaboration.
- Problem-solving skills.
- Adaptability and resilience.
- Customer-focused mindset.
- Leadership and time management.
J-18808-Ljbffr