Enable job alerts via email!

Site Reliability Engineer II

IBM Computing

Austin (TX)

Remote

USD 90,000 - 150,000

Full time

Yesterday

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as a Reliability Engineer on the Boundary Product team, where you'll enhance customer experiences through innovative cloud solutions. This role focuses on driving service reliability, developing tools for metric visibility, and collaborating across teams to improve software performance. You'll be empowered to troubleshoot issues, implement reliable design patterns, and participate in a 24/7 on-call rotation. If you have a passion for developer productivity and a desire to make a difference in a dynamic environment, this opportunity is perfect for you.

Qualifications

5+ years of experience with production applications, especially in Golang.
Strong debugging skills for performance bottlenecks in live services.

Responsibilities

Develop tooling for service reliability and metric visibility.
Participate in incident management and advocate for best practices.

Skills

Golang

AWS

Incident Management

Database Systems

Clear Communication

Performance Debugging

Tools

PagerDuty

incident.io

AWS Aurora

Postgres

Nomad

Traefik

Introduction

A career in IBM Software means you'll be part of a team that transforms our customer's challenges into industry-leading solutions. We are an infinitely curious team, always seeking new possibilities, and dedicated to creating the world's leading AI-powered, cloud-native software solutions. Our renowned legacy creates endless global opportunities for our network of IBMers. We are a team of deep product experts, ensuring exceptional client experiences, with a focus on delivery, excellence, and obsession over customer outcomes. This position involves contributing to HashiCorp's offerings, now part of IBM, which empower organizations to automate and secure multi-cloud and hybrid environments. You will join a team managing the lifecycle of infrastructure and security, enhancing IBM's cloud solutions to ensure enterprises achieve efficiency, security, and scalability in their cloud journey.

Your role and responsibilities

HashiCorp Boundary aims to provide a seamless, just-in-time remote access experience for customers to their infrastructure and other web applications without having to worry about passwords, certificates or other credentials. Boundary is offered as a Cloud platform, and this role will be part of the Boundary Enterprise Enablement team whose primary focus will be scale and reliability to enable hypergrowth among medium and large enterprises.

What you’ll do (responsibilities)

As an engineer on the Boundary Product Reliability team,you will:

Develop a deep understanding on how customers use Boundary Cloud and enhance their experience through reliability

Drive service reliability by developing tooling that enables metric visibility using SLIs, SLOs, and SLAs

Champion incident management processes that directly impact customer experience

Reduce the operational overhead of HashiCorp Boundary product and leverage data to understand the largest source of reliability risk

Deploy, manage, monitor a large-scale Boundary Cloud

Predict our future failures and work proactively to mitigate them

Have a passion for developer productivity to make other engineers lives better

Empowering engineers to troubleshoot their own issues by developing tools, frameworks and guardrails for safety

Advocate and implement reliable design patterns (circuit breakers, graceful degradation, Zero-Downtime Upgrades etc.)

Partner with the broader HashiCorp organization to learn from incidents through a blameless postmortem process

Collaborate across teams to improve our tools based on experiences found from running our own software in production

Participate in a 24/7 on-call rotation that supports our production services

This job can be performed from anywhere in the US

Required technical and professional expertise

5+ years of handling production applications at scale: Backend applications written in Golang, Databases, Observability, and AWS Primitives
Strive for quality through maintainable code and comprehensive testing from development to deployment
Clear communication skills while remaining empathetic and kind
An eagerness to learn through humility and reflection
Experience debugging performance bottlenecks for live services and database systems
Led or participated in incidents through incident management tools like incident.io, PagerDuty, etc

Preferred technical and professional experience

Working knowledge of industry best practices related to information security
Working knowledge on AWS Aurora or postgres, Nomad or other orchestration platforms, Traefik or other load balancing technologies
Experience or willingness to conceive, document and advocate for best practices

IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer

Akamai Technologies GmbH

Remote

USD 106,000 - 222,000

6 days ago

Be an early applicant