About usKing's e-Research department supports cutting edge computational and data intensive research across all disciplines at the College. We provide high performance compute, private and public cloud infrastructure and trusted research environments as the core building blocks of modern, data driven research. Alongside these infrastructure services e-Research provides Research Software Engineering, Infrastructure Engineering and Data Governance expertise to individual research projects.
About the roleWe are expanding our Principal Infrastructure Engineering team to support two exciting new projects (in addition to our existing services):
- Pharos AI , a £43m (£18.9m DSIT, £24m partners) grant to build an AI development platform to unlock the value of large multi-modal cancer datasets hosted in a pair of biobank secure data environments operated by Guy's and Thomas' and Bart's NHS Trusts. This includes extensive support from and collaboration with leading edge industry partners e.g. AI precision medicine and drug discovery.
- The King's AI+ strategic investment recruiting 20 AI focused fellows and adding £2m of next-generation GPU capacity to King's Computational Research Engineering and Technology Environment (CREATE).
Work for these projects will be shared across a team of four principal infrastructure engineers to deliver AI/ML ops infrastructure services, scale out compute to national AI supercomputers and public cloud providers, produce quality portable and open sourced infrastructure as code.
Our Principal Infrastructure Engineers work collaboratively with large amounts of technical freedom and decision making autonomy. We build almost exclusively with FOSS and wish to put more of our work back into the community over time. You can expect to work with the following technologies: Apache, Bacula, Ceph (CephFS, RADOSGW, RBD), Discourse, Flask, Git, GitLab, GLPI, Grafana, Laravel, Let's Encrypt, Linux (primarily Ubuntu), mkdocs, Nginx, OpenStack, OpenSSL, OpenSSH, OpenTofu, OpenOnDemand, OpenVPN, ProxMox, Python, Puppet, SLURM, Spack, Squid, Trivy, VSCode, Wireguard, ZFS.
To get a feel of our work to-date please take a look at our docs , GitHub and watch this presentation at CIUK 2023
This is a full-time (35 hours per week) position, offered on a fixed-term contract, currently funded until 31/5/2027, but it is planned to convert to permanent.
We also anticipate that a second, similar position will become available shortly, subject to funding approval. Candidates may be considered for this additional role if funding is confirmed.
About youTo be successful in this role, we are looking for candidates to have the following skills and experience:
Essential criteria1. Demonstrable ability to deploy and maintain large-scale compute, storage and/or networking infrastructure through code (e.g. Ansible, Puppet, Terraform, OpenTofu)
2. Demonstrable ability to develop software with experience as the primary developer of projects with a large modular codebase, ideally dealing with issues such as concurrency, caching and performance scaling
3. Demonstrable ability to diagnose network and operating system level issues with tools such as strace, tcpdump, etc
4. Demonstrable ability to mentor and train more junior technical staff including review of software and infrastructure project code
5. Demonstrable ability to deploy and maintain public or private cloud infrastructure, and high performance compute clusters with experience of stability and storage engineering in relation to these
6. Demonstrable ability to deploy and maintain monitoring and metrics platforms at scale
7. Strong knowledge of security fundamentals and practical experience of securing Linux systems and related infrastructure
8. Proven ability to work with a high degree of autonomy within a high performing engineering team, fostering a culture of transparent collaboration and building technical consensus where necessary
Desirable criteria1. Experience profiling and optimising AI/ML workloads
2. Experience deploying, configuring and maintaining Kubernetes clusters
3. Experience developing applications for and deployed onto Kubernetes clusters
4. Performance profiling of compute and/or IO intensive workloads
5. Ability to read, understand and troubleshoot opensource software written in C
Downloading a copy of our Job DescriptionFull details of the role and the skills, knowledge and experience required can be found in the Job Description document, provided at the bottom of the next page after you click "Apply Now". This document will provide information of what criteria will be assessed at each stage of the recruitment process.
Further InformationBenefits:- KCL Grade 8 £64,139 - £73,529
- High flex mostly remote working (typically between 1 - 8 days in the office per month depending on personal preference)
- 1 day every 2 weeks dedicated to personal development on relevant technology of your choosing
- Conference attendance (e.g. CERN storage week, FOSDEM Brussels, CIUK, AI UK)
- 35 hour week
- 30 days annual leave (plus Christmas closure)
We pride ourselves on being inclusive and welcoming. We embrace diversity and want everyone to feel that they belong and are connected to others in our community.
We are committed to working with our staff and unions on these and other issues, to continue to support our people and to develop a diverse and inclusive culture at King's.
We ask all candidates to submit a copy of their CV, and a supporting statement, detailing how they meet the essential criteria listed in the advert. If we receive a strong field of candidates, we may use the desirable criteria to choose our final shortlist, so please include your evidence against these where possible.
To find out how our managers will review your application, please take a look at our ' How we Recruit ' pages.