Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.
Responsibilities and Duties
- Utilize software tools and automated tasks for continuous monitoring and reliability of applications;
- Act swiftly in response to emergency situations impacting system reliability in production environments, performing root cause analysis for ongoing incidents;
- Oversee and streamline change management processes to enhance system performance and reliability. Own releases to production environments;
- Work closely with development teams throughout the software lifecycle, focusing on resolving system-related issues and automating routine tasks to improve productivity;
- Ensure the reliability and scalability of systems, maintaining high performance and efficiency standards;
- Proficiency in monitoring tools like Azure Monitoring, App Insights, Prometheus, Grafana; project tracking and version control with tools like JIRA, SVN, GitHub;
- Experience with Infrastructure as Code tools such as Terraform, ARM / Bicep, Pulumi, etc., and release management tools like ArgoCD, Harness, Octopus, etc.;
- Experience with incident alerting tools such as PagerDuty, Opsgenie, and container orchestration platforms like Kubernetes, AKS, etc.
About Encora
Encora is a leading digital engineering and modernization partner for top enterprises and digital-native companies worldwide. With over 9,000 experts across 47+ offices and innovation labs, our practices include Product Engineering & Development, Cloud Services, Quality Engineering, DevSecOps, Data & Analytics, Digital Experience, Cybersecurity, and AI & LLM Engineering.
At Encora, we hire professionals based solely on skills and qualifications, without discrimination based on age, disability, religion, gender, sexual orientation, socioeconomic status, or nationality.