Job Description
A Mission-Oriented Opportunity
This role is with a team that builds software enabling data-driven decision-making and operational effectiveness at global scale. Their platforms help partners solve real-world problems—from forecasting supply chain disruptions to accelerating medical breakthroughs.
The Role
A team focused on mission-critical production infrastructure—spanning hundreds of Kubernetes clusters across on-premise environments, from large data centers to edge devices—is seeking a Senior Infrastructure Engineer with deep expertise in Ceph. This individual will enhance the scale, reliability, and performance of ruggedized Kubernetes offerings operating under complex and novel constraints.Kubernetes offerings operati
Ideal candidates are passionate about infrastructure at scale, adept in Ceph, and eager to contribute to the broader open-source ecosystem.
Key Responsibilities
- Manage Ceph at Scale: Design, deploy, and maintain Ceph storage solutions across a variety of hardware environments with an emphasis on high availability and performance.
- Automate Deployments: Create automation frameworks and tooling to manage large-scale Ceph deployments, minimizing manual effort and maximizing operational efficiency.
- Innovate and Contribute: Drive the integration of emerging tools and features from the Ceph and CNCF ecosystems, and contribute upstream to relevant open-source projects.
- Community Engagement: Actively participate in the Ceph developer and CNCF communities through collaboration, contribution, and knowledge sharing.
- Infrastructure Evolution: Partner with peers to architect and build scalable, secure, and resilient infrastructure for next- deployments.
Qualifications
- Ceph & Rook Mastery: Proven experience managing Ceph clusters in production environments, ideally via Rook.
- Automation Skills: Proficiency with tools like Terraform, Kubernetes Operators, and programming in Go, Java, or equivalent.
- Systems Programming Experience: Background in Go, Rust, or C/C++ for system-level development.
- Hardware & OS Knowledge: Strong familiarity with system hardware, Linux-based OS internals, and diagnostic tools.
- Networking Insight: Understanding of network architectures and experience with CNIs or cloud networking solutions.
- Data Center Experience: Hands-on experience managing on-premise hardware or serving as a sysadmin/Site Reliability Engineer in production environments.
Minimum Requirements
- 4+ years of software development focused on infrastructure and operational excellence
- 2+ years of system design experience, particularly in scaling and reliability
- 1+ year managing production-grade Ceph clusters
- Bachelor’s degree in Computer Science or equivalent experience