We are looking for a skilled Site Reliability Engineer (SRE) with expertise in Ansible and Linux to join our dynamic team. The successful candidate will play a critical role in maintaining the reliability, scalability, and performance of our infrastructure, driving automation, and collaborating with development teams to optimize system efficiency.
Key Responsibilities
- Infrastructure Automation
- Automate and maintain IT infrastructure using Ansible to streamline operations.
- System Administration (Linux and Windows)
- Manage virtual and physical Windows and Linux servers.
- Automate server patching and updates to ensure systems remain current.
- Implement automated security measures for all servers.
- Monitor server performance and health.
- Maintain comprehensive system documentation, including configuration and troubleshooting guides.
- Conduct troubleshooting and root cause analysis as needed.
- Ensure robust backup, disaster recovery, and business continuity plans are in place and followed.
- Azure Cloud Management
- Collaborate with DevOps to deploy, configure, and manage Azure virtual machines and resources.
- Monitor cloud services for availability, performance, and security.
- Work with the networking team to implement, monitor, and secure cloud networking infrastructure.
- Ensure backup, disaster recovery, and business continuity plans are maintained for cloud systems.
- System Monitoring and Optimization
- Deploy and maintain monitoring tools for proactive system oversight and alerting.
- Analyze performance data to identify and resolve bottlenecks.
- Conduct capacity planning to support scalability and meet business needs.
- Partner with development teams to enhance application performance on infrastructure.
- Documentation and Collaboration
- Create and update technical documentation, including system configurations and procedures.
- Work with cross-functional teams to provide technical support and solutions.
- Participate in on-call rotations and respond promptly to system emergencies.
- Stay informed on industry trends, emerging technologies, and best practices in system administration, cloud computing, and virtualization.
Qualifications
- Bachelors degree in Computer Science, Information Technology, or a related field (or equivalent experience).
- Relevant certifications (e.g., Linux Professional Institute (LPIC), Microsoft Certified: Azure Administrator Associate) are a plus.
Experience & Technical Skills
- Minimum of 8 years in an Enterprise IT environment, with at least 3 years in a DevOps or SRE role.
- Strong expertise in Ansible for automation and configuration management.
- Proficient in Linux system administration (installation, configuration, troubleshooting).
- Hands-on experience with hypervisor technologies (e.g., VMware, Hyper-V, Proxmox).
- Knowledge of containerization technologies (e.g., Docker, Kubernetes).
- Experience managing Azure cloud services, including VMs, storage, networking, and security.
- Proficiency in scripting languages (e.g., Bash, PowerShell, Python) for automation.
Skills & Competencies
- Excellent problem-solving skills and ability to work independently or in a high-performance team.
- Strong sense of ownership over tasks, projects, and issues.
- Effective communication and interpersonal skills to collaborate with stakeholders at all levels.