Enable job alerts via email!

Senior Site Reliability Engineer - DevOps

LogicMonitor

Greater London

On-site

GBP 60,000 - 100,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Senior Site Reliability Engineer to lead operational excellence for their AI infrastructure. This role involves maintaining high availability, implementing resilient IT applications, and automating infrastructure processes. You'll work with cutting-edge technologies, including Docker, Kubernetes, and AWS, while enhancing system performance and security. Join a dynamic team dedicated to innovation and excellence, where your contributions will directly impact the company's success and drive the future of AI solutions. If you are passionate about DevOps and eager to tackle complex challenges, this opportunity is perfect for you.

Qualifications

5+ years as a DevOps Engineer or SRE with a focus on resilient IT applications.
Proficient in Python and experienced with Docker and Kubernetes.

Responsibilities

Maintain uptime of LogicMonitor's SaaS-based service and drive technical enhancements.
Design and implement new production deployments and ensure security.

Skills

DevOps Engineering

Site Reliability Engineering (SRE)

Linux System Administration

Networking Technologies

Infrastructure as Code (IaC)

Containerization (Docker/Kubernetes)

Amazon Web Services (AWS)

CI/CD Pipeline Design

Python Programming

Security Principles

Tools

Terraform

Prometheus

Senior Site Reliability Engineer - DevOps

Artificial Intelligence, London, UK

About Us:

We love going to work and think you should too. Our team is dedicated to trust, customer obsession, agility, and striving to be better every day. These values serve as the foundation of our culture, guiding our actions and driving us towards excellence.

This position is located in London, England. Our office is situated in a core location near Waterloo and Blackfriars on the Southbank.

What You'll Do:

This role will take a lead in the operational uptime and continued expansion of LM Edwin AI infrastructure by serving as a facilitator of operational excellence. Responsibilities include designing and implementing new production deployments of SOA-based software across cloud datacentres, as well as providing guidance on organizing, securing and automating existing infrastructure and deployments.

Maintain uptime of LogicMonitor's (Edwin AI) SaaS-based service and drive technical/process enhancements to improve uptime.
Lead efforts to design and implement resilient IT applications using DevOps and SRE principles.
Deploy production applications and drive improvements to the deployment process.
Monitor system performance and troubleshoot issues to ensure high availability and reliability.
Design and deploy new application components.
Design and deploy new infrastructure components and integrations.
Ensure security of the production environment.
Develop and implement automated disaster recovery processes to minimise system downtime.
Identify opportunities for improvement in system performance, deployment speed, and scalability.
Write high-quality code to automate various aspects of infrastructure maintenance and deployment.
Support engineering and work closely with engineers to drive operational and architectural/design changes.
Own, manage, and execute multiple large and technically complex projects across teams.
Provide direct technical guidance to help team members achieve goals and improve their productivity.
Participate in the recruitment and hiring of new engineers.

What You'll Need:

5+ years as a DevOps Engineer or SRE with designing and implementing resilient IT applications using DevOps and SRE principles.
Good understanding of Linux system administration and 3+ years of hands-on experience.
Good understanding of networking technologies.
Experience building IaC automations using Terraform.
Production experience of containers and container orchestration tools (Docker/Kubernetes).
Good understanding of Amazon Web Services.
Experience of designing/implementing CI/CD pipelines including production deployments.
Experience building and working with logging and metrics solutions such as Prometheus.
Experience programming with RESTful web services.
Proficient Python developer.
Well-versed in security principles, both systems and network.
Excellent written and verbal communications skills with a track record of improving documentation and processes.
Experience in carrying out complex problem determination and Root Cause Analysis across complex distributed systems.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs