Site Reliability Engineer (SRE / DevOps)
Site Reliability Engineer (SRE / DevOps)
At Professional.me, the Site Reliability Engineer (SRE) will play a mission-critical role in scaling and securing the core infrastructure that powers our AI-driven hiring platform. As an internal hire, you’ll bring deep familiarity with our systems, products, and priorities- and now take ownership of the underlying reliability, performance, and cost efficiency across environments.
This is a senior-level position that blends software engineering with systems thinking. You’ll work closely with engineering, product, and data teams to architect and maintain the infrastructure that keeps Professional.me fast, secure, and reliable as we scale.
Key Responsibilities
- Architect, implement, and maintain highly available and scalable infrastructure on AWS, leveraging advanced services and best practices for security, reliability, and cost optimization.
- Manage, monitor, and tune databases including PostgreSQL, Redis, ClickHouse, and OpenSearch / ElasticSearch, ensuring optimal performance, data integrity, and high availability.
- Design, deploy, and maintain robust queueing and messaging systems such as Kafka and NATS, supporting high-throughput, low-latency distributed applications.
- Develop and maintain Infrastructure as Code (IaC) using Terraform, ensuring reproducibility, version control, and automated provisioning of cloud resources.
- Set up, configure, and optimize CI / CD pipelines using GitHub Actions, automating build, test, and deployment workflows for rapid and reliable software delivery.
- Create, manage, and enhance monitoring and observability solutions with Grafana, including the development of comprehensive dashboards and alerting systems for proactive incident response.
- Conduct regular cost analysis and optimization of cloud resources, identifying opportunities to reduce spend while maintaining performance and reliability.
- Collaborate closely with development, QA, and product teams to ensure seamless integration of reliability practices throughout the software lifecycle.
- Lead incident response, root cause analysis, and post-mortem processes, driving continuous improvement in system resilience and operational processes.
- Document infrastructure, processes, and best practices to ensure knowledge sharing and operational transparency across teams.
- Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and cloud infrastructure.
Required Experience & Skills
- Extensive hands-on experience with AWS, including advanced services (EC2, RDS, S3, Lambda, VPC, IAM, CloudWatch, ECS / EKS, etc.), with a proven track record of architecting and managing large-scale cloud environments.
- Deep expertise in managing, tuning, and troubleshooting databases such as PostgreSQL, Redis, ClickHouse, and OpenSearch / ElasticSearch, including backup, replication, and disaster recovery strategies.
- Advanced proficiency in Infrastructure as Code using Terraform, including module development, state management, and integration with CI / CD workflows.
- Strong experience with queueing and messaging systems like Kafka and NATS, including setup, scaling, monitoring, and troubleshooting in production environments.
- Demonstrated ability to design, implement, and optimize CI / CD pipelines using GitHub Actions, with a focus on automation, reliability, and security.
- Expert-level skills in monitoring, observability, and alerting using Grafana, Prometheus, and related tools, including dashboard creation and metric analysis.
- Proven experience in cost optimization strategies for cloud infrastructure, including resource right-sizing, reserved instances, and usage monitoring.
- Solid scripting and automation skills in languages such as Python, Bash, or Go, enabling efficient operations and process automation.
- Strong understanding of networking, security best practices, and compliance requirements in cloud environments.
- Excellent problem-solving, analytical, and troubleshooting abilities, especially in high-pressure, production-critical situations.
- Effective communication and collaboration skills, with experience working in cross-functional teams and fast-paced startup environments as well as large organizations.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with demonstrated impact in both startup and enterprise settings.
- Relevant certifications such as AWS Certified Solutions Architect, AWS Certified DevOps Engineer, or Terraform Associate are highly desirable.
- Experience with agile methodologies and modern software development practices.
- Familiarity with incident management frameworks and ITIL processes is a plus.
Tools & Technologies
- Infrastructure as Code : Terraform
- CI / CD : GitHub Actions, Jenkins, CircleCI
- Monitoring & Observability : Grafana, Prometheus, ELK Stack, CloudWatch
- Scripting : Python, Bash, Go
- Version Control : Git, GitHub
- Configuration Management : Ansible, Chef, or Puppet
This role offers the opportunity to shape and optimize mission-critical infrastructure in a dynamic, technology-driven environment. The SRE will have a direct impact on system reliability, scalability, and cost efficiency, while working with cutting-edge tools and collaborating with talented teams across the organization. Success in this position will be measured by improvements in uptime, deployment velocity, cost savings, and the overall resilience of the platform.
By applying to this position, you are granting us permission to process your CV and keep your profile on file for consideration for this and future opportunities.
Seniority level
Seniority level
Mid-Senior level
Employment type
Employment type
Full-time
Job function
IT Services and IT Consulting and Software Development
Referrals increase your chances of interviewing at Professional.me by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.
Senior Site Reliability & DevOps Engineer
Site Reliability Engineer II - Real-Time and Big Data
Dubai, Dubai, United Arab Emirates 1 year ago
Dubai, Dubai, United Arab Emirates 1 year ago
Dubai, Dubai, United Arab Emirates 1 year ago
Dubai, Dubai, United Arab Emirates 1 year ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
J-18808-Ljbffr