We’re looking for a Staff Operations Engineer – Ceph Storage to support our storage team in the Cloud Platform division. Our scale spans the globe, with transactions happening 24x7 across our data centers. Every second, millions of requests are evaluated across our exchange. To achieve our mission, global efficiency and reliability are crucial, as every millisecond counts in our business.
What We’re Looking For:
- Facilitator: Ability to relay information and ideas effectively within and across teams. While technical skills are vital, your ability to collaborate is equally important.
- Adaptable: Capable of keeping up with industry fast-paced changes and prioritizing tasks amidst competing scope and timelines.
- Technical: Strong foundation in Operations, with experience solving complex problems and building solutions, including CI/CD, real-time monitoring, and handling production issues.
- Rigorous: Experience designing and managing massive, globally distributed systems that handle billions of transactions daily. Your approach should be thorough, scalable, and reliable.
Here’s What You’ll be Doing:
- Design, build, and operate a highly scalable, performant, and resilient storage layer on a global scale.
- Develop and maintain automation for logging, monitoring, and maintenance of the storage layer.
- Work with technologies such as Hadoop, Spark, Aerospike, Kafka to enhance and optimize systems.
- Participate in complex security system designs and mentor junior team members.
- Take ownership of large projects and components as a senior contributor.
- Champion process and procedure improvements within the team and division.
- Influence the team’s direction, fostering accountability, trust, and goal focus.
- Promote company values internally and externally.
Here's What You Need:
- Experience building, maintaining, and troubleshooting open-source distributed storage solutions like Ceph and storage orchestrators such as Rook, in an automated and large-scale environment.
- Experience with Infrastructure as Code (IaC) and configuration management tools like Salt, Ansible, Puppet, or Terraform.
- Experience with storage-level replication technologies.
- Strong skills in capacity planning, disaster recovery, and monitoring.