As a Staff Engineer in Distributed Systems, you will be a hands-on expert driving the architecture, performance, and operational excellence of our core systems. You will apply your deep technical expertise to solve the hardest, most complex distributed systems problems, ensuring our infrastructure is highly performant, reliable, and scalable. A systems-level programmer at heart, you are motivated by debugging and optimizing complex solutions to deliver simple, high-impact results. You will have extensive exposure and hands-on experience with major cloud platforms and modern technologies.
Key Responsibilities
- Systems Design and Architecture: Design and implement the architecture for complex, large-scale distributed systems. You will focus on fundamental principles like scalability, high availability, and fault tolerance, and ensure simple, straightforward solutions to complex problems.
- Distributed Systems Debugging: Utilize your extensive experience to debug and resolve the most intricate and obscure issues in distributed systems across multiple nodes and network environments. You will lead complex debug efforts for customer-facing issues and drive root-cause analysis.
- Performance Engineering and Optimization: Identify and eliminate system bottlenecks by deeply profiling, optimizing, and tuning core components for maximum performance. Your expertise will directly influence critical system metrics like latency and throughput.
- Hands-On Development: Write and review high-quality, performant, and reliable software using low-level systems languages such as Go, C++, or Rust. Your code contributions will directly impact critical, high-volume production systems. Experience in Python is a plus.
- Cloud Platform Expertise: Design, develop, and operate systems deployed on modern cloud environments, with explicit, hands-on experience on platforms like AWS, Google Cloud, or Azure. You will leverage the ecosystem and best practices of these platforms to deliver resilient solutions.
- Operational Excellence: Embed a culture of operational rigor within the team by participating in on-call rotations, managing incident response, and conducting blameless post-mortems. You will build and expand monitoring, logging, and alerting systems to enhance system observability.
- Mentorship and Technical Leadership: Serve as a technical leader and mentor to other engineers, raising the overall skill level of the organization. You will contribute to the evolution of our engineering standards and processes.
Basic Qualifications
- 8+ years of hands-on experience in software development, with a multi-year, hands-on focus on large-scale distributed systems.
- Deep experience with distributed systems design, including failure modes and performance optimization.
- Expert-level knowledge of at least one low-level systems language (e.g., Golang, C++, or Rust).
- Proven experience debugging complex, multi-service, and network-related issues in a distributed production environment.
- Strong hands-on experience designing, developing, and operating services on cloud platforms such as AWS, Google Cloud, or Azure.
- Experience with containerization and orchestration tools like Docker and Kubernetes.
- Excellent written and verbal communication skills.
Preferred Qualifications
- Deep expertise in a major technical area of distributed systems (e.g., consensus algorithms, distributed databases, large-scale data processing).
- Familiarity with cloud-native technologies, microservices architecture, and DevOps practices.
- Proven track record of defining and driving engineering initiatives that involve cross-team collaboration.