Join to apply for the Manager, DevOps role at Penguin Solutions.
Overview
Penguin Solutions Managed Services provides dedicated, remote, Linux systems DevOps for complex, integrated environments involving high-performance computing, cloud, and enterprise systems. This position requires both technical skills, including the ability to understand, document, configure, administer, troubleshoot, and resolve issues in Linux environments, as well as the ability to manage people and processes. This is a customer-facing position.
Responsibilities
- Manage a team of skilled DevOps Engineers.
- Perform reviews, staff analysis, and present business plans to meet current and future needs.
- Work in data center environments with software, hardware, and network components.
- Install, monitor, and maintain data center equipment.
- Build & maintain CI/CD pipelines.
- Integrate systems and platforms through infrastructure as code.
- Build automation workflows to enable lights-out operations.
- Collaborate with team members, provide IT support, and resolve errors.
- Stay updated on advancements in data center infrastructure and technologies.
- Document network processes with support from Sr. Onsite Hardware Technicians.
- Respond to network and server errors after hours.
- Participate in weekly on-call rotation.
- Collaborate with customers to enable initiatives.
- Serve as Subject Matter Expert on HPC and related technologies.
Qualifications
- Bachelor’s Degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
- 6+ years managing DevOps teams.
- 12+ years hands-on experience with UNIX/Linux servers, CI/CD pipelines, and infrastructure as code.
- Must be a US Citizen.
Skills
- Strong leadership and mentorship skills.
- Excellent customer-facing skills.
- Ability to prioritize and deliver tasks on time.
- Strong Linux systems administration skills and experience with open-source technologies.
- Understanding of Linux networking protocols.
- Extensive experience with Ansible scripting (5+ years).
- Proficiency in Python (5+ years).
- Knowledge of Infrastructure as Code, CI/CD, and DevOps concepts.
- Ability to troubleshoot performance issues across the infrastructure stack.
- Experience with HPC/AI performance optimization and administration of HPC technologies.
- Ability to run benchmarks on large HPC clusters and optimize code (C, Fortran).
- Familiarity with CPU and GPU compilers (gcc, Intel, AMD, NVIDIA).
- Knowledge of HPC schedulers (SLURM, PBS, LSF).
- Effective communication skills with team members and clients.
Preferred Skills
- HPC Systems Management knowledge (e.g., Scyld Clusterware).
- Broad tech knowledge in HPC, AI, Cloud, and Data Storage.
- Experience with Linux cluster technologies and optimization techniques.
- Linux Certifications (e.g., RHCSA, RHCE).
- Cloud Certifications (e.g., AWS, GCP).
- Ability to install, configure, and support software applications.
- Proactive in liaising with OEM/Vendors for application support.
- Excellent verbal, written, and interpersonal communication skills.
Location
Remote position within the United States.
Travel
10-25% required.
Compensation & Benefits
Base salary range: $148,000 - $175,000, with potential variation based on experience and skills. Includes bonus eligibility, medical/dental/vision benefits, 401(k), PTO, life insurance, and Employee Assistance Program.
Inclusion & Belonging
We are committed to creating an inclusive environment that embraces differences and fosters belonging for all.
Equal Opportunity
We are an Affirmative Action/EOE employer, providing equal opportunity without regard to age, race, gender, disability, veteran status, or other protected characteristics.