Enable job alerts via email!

Technical Duty Officer / Incident Commander (SRE)

Xero

United States

Remote

USD 80,000 - 140,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking experienced Site Reliability Engineers to enhance their incident management processes. This role is pivotal in driving best practices and building a world-class SRE culture. You will lead technical discussions, coordinate responses during critical outages, and develop scalable frameworks to improve service reliability. With a focus on customer satisfaction and continuous learning, you'll work closely with engineering teams to mitigate risks and enhance operational efficiency. Join a forward-thinking company that values your contributions and offers generous benefits to support your wellbeing and career development.

Benefits

Generous Paid Leave
Health Insurance
Life Insurance
Income Protection
Employee Assistance Program
Flexible Working
Career Development
Employee Share Plan
Wellbeing and Sports Programmes
26 Weeks Paid Parental Leave

Qualifications

  • Experience in Site Reliability Engineering or Operations environments.
  • Hands-on experience troubleshooting AWS hosted services.

Responsibilities

  • Own the incident management process ensuring reliability across products.
  • Lead during critical outages coordinating teams for quick resolution.

Skills

Site Reliability Engineering
AWS Troubleshooting
Networking Knowledge
TCP/IP Troubleshooting
Python Coding
Strong Communication Skills

Job description

Our Purpose

At Xero, we’re here to help you supercharge your business. We do this by automating routine tasks, surfacing actionable insights and connecting businesses with the right data, advisors and apps. When that happens, we’re not only making life better for small business, we’ll be building a stronger economy that can change the world.

About the team

Xero’s Incident and Problem Management team are a part of the Site Reliability Engineering (SRE) organization and are responsible for the build, delivery and ongoing maintenance of robust process and tooling around Incident management.

The team is responsible for driving enduring reliability at Xero through robust, consistent and fast response to high severity incidents. They are responsible for building a world class process and ensuring that process matures as the demands of the business grow.

About the roles

These positions require experienced SRE professionals with a strong technical background, deep experience in SRE, a passion for building and delivering robust processes, and extensive experience leading technical response to high severity cloud issues.

They will drive best practice across the business and contribute to the ongoing transformation of the Xero SRE culture. As expert communicators, they will lead technical discussions to identify and track actions associated with and identified during incident situations.

Across our SRE function, we're looking for those who are keen to deep dive into causes of incidents and proactively examine the potential causes of future incidents; working with engineering teams to remove the risk of that failure scenario. Ultimately building playbooks and automation to ensure quick and effective responses. In addition, provide ongoing training across the business to ensure the process is well understood and adhered to.

These roles will form the backbone of a new team, providing a Technical Duty Officer (TDO) function within the business. TDOs are incident commanders who use SRE skillsets to drive fast mitigation and enduring resolution of impactful events.

What you'll do:

  • Own the incident management process, ensuring it drives enduring reliability across all products and services within Xero.
  • Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.
  • Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.
  • Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team.
  • Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability.
  • Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.

What you'll bring:

  • Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment
  • Hands-on experience troubleshooting AWS hosted services
  • Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.
  • Coding experience (preferably Python) building tools, scripting, or automation
  • Strong communication (oral & written) skills including the ability to translate technical issues/concepts into agreed actions

Why Xero?

Offering very generous paid leave to use however you’d like (plus statutory holidays!), dedicated paid leave to care for your physical and mental wellbeing as well as an Employee Assistance Program to access mental health care for you and your family, health insurance, life insurance, and income protection, wellbeing and sports programmes, employee resource groups, 26 weeks of paid parental leave for primary caregivers, an Employee Share Plan, beautiful offices, flexible working, career development, and many other benefits that reflect our human value, you’ll do the best work of your life at Xero.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Technical Coach

Us Bank

Cincinnati

Hybrid

USD 124,000 - 124,000

30+ days ago