Enable job alerts via email!

Major Incident Manager, Eng Support-Incident Management Team - USDS

TikTok

Mountain View (CA)

Hybrid

USD 80,000 - 140,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a dynamic team at a leading platform where creativity thrives! As an Incident & Problem Manager, you will play a pivotal role in ensuring business continuity by effectively managing high-priority incidents. Collaborate across functions to enhance the reliability of services while driving process improvements. With a hybrid work model, you’ll enjoy flexibility while contributing to a mission that inspires creativity and brings joy to millions. Embrace the opportunity to learn, innovate, and grow in a vibrant environment that values your contributions and fosters teamwork.

Qualifications

  • 2+ years of experience in Incident Management with a focus on high-severity incidents.
  • Strong oral and written communication skills for diverse audiences.

Responsibilities

  • Lead resolution of critical incidents to minimize customer impact.
  • Monitor and report on incident management programs, driving process improvements.

Skills

Incident Management
Problem-Solving
Communication Skills
Technical Knowledge of Cloud Architecture
Troubleshooting Techniques

Education

Bachelor’s degree in Computer Science
Equivalent work experience

Tools

Grafana
Kubernetes

Job description

Get AI-powered advice on this job and more exclusive features.

Responsibilities

About TikTok U.S. Data Security
TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. U.S. Data Security (“USDS”) is a subsidiary of TikTok in the U.S. This new, security-first division was created to bring heightened focus and governance to our data protection policies and content assurance protocols to keep U.S. users safe. Our focus is on providing oversight and protection of the TikTok platform and U.S. user data, so millions of Americans can continue turning to TikTok to learn something new, earn a living, express themselves creatively, or be entertained. The teams within USDS that deliver on this commitment daily span across Trust & Safety, Security & Privacy, Engineering, User & Product Ops, Corporate Functions and more.

Why Join Us

Creation is the core of TikTok's purpose. Our platform is built to help imaginations thrive. This is doubly true of the teams that make TikTok possible. Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day. To us, every challenge, no matter how difficult, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always. At TikTok, we create together and grow together. That's how we drive impact - for ourselves, our company, and the communities we serve. Join us.

About the Team

USDS Tech and Product at TikTok provides core product platforms and services with leading infrastructure and applications. The Incident Management team plays a critical role in ensuring business continuity by addressing and mitigating high-priority incidents effectively. This role offers the opportunity to collaborate across functions to minimize impact, improve processes, and enhance the reliability of TikTok’s platforms and services.

About the Role

The Incident & Problem Manager will oversee the resolution of high-priority incidents, ensuring minimal disruption and swift resolution. This includes owning incident escalations, documenting processes, and collaborating with cross-functional teams to identify root causes and implement short-term and long-term solutions.

In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.

Responsibilities
  1. Serve as a subject matter expert in incident management, leading the resolution of critical incidents to minimize customer/business impact.
  2. Partner with SRE teams and service owners to ensure timely resolution of high-severity incidents and create high-quality RCAs.
  3. Act as an escalation point for critical incidents and lead crisis response processes as required.
  4. Prioritize incidents based on customer and operational impact, ensuring optimal resource allocation for swift resolution.
  5. Monitor, evaluate, and report on incident management programs, identifying trends and areas for improvement.
  6. Drive process improvements to minimize incident frequency and severity while enhancing efficiency.
  7. Implement automated procedures to capture incident data consistently, supporting data-driven decision-making.
  8. Lead post-incident reviews with cross-functional teams, identifying actionable insights and process optimizations.
  9. Partner with senior leaders to facilitate incident management communications and project delivery.
  10. Generate communications tailored for technical and non-technical audiences, including customer-facing updates.
  11. Collaborate with cross-functional teams to ensure effective containment and remediation strategies.
  12. Ability to work Sunday to Thursday, from 5 PM PT to 2 AM PT.
  13. Provide rotational on-call support (24x7x365) to ensure incidents are handled promptly and effectively.
  14. Stay updated on infrastructure dependencies and emerging technologies to proactively mitigate risks.
Qualifications

Minimum Qualifications:

  1. Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.
  2. 2+ years of experience in Incident Management, including leadership of high-severity incidents.
  3. Experience with monitoring solutions and applications such as Grafana.
  4. Technical knowledge of cloud architecture and design.
  5. Proficiency in troubleshooting techniques and problem-solving in a 24x7x365 environment.
  6. Strong oral and written communication skills, with the ability to effectively communicate with diverse audiences.
  7. Must be willing to be flexible with working hours depending on the needs of the business.

Preferred Qualifications:

  1. Proven ability to lead incident response calls confidently, driving toward resolution and minimizing downtime.
  2. Experience analyzing incident trends and operational metrics to inform prevention strategies.
  3. Expertise in micro-services architecture, and Linux environment with foundation knowledge of Kubernetes.
  4. Demonstrated success in process improvement, including conducting root cause analyses and implementing efficient solutions.
  5. Strong interpersonal and influencing skills to collaborate effectively across teams without direct authority.
  6. Familiarity with leading investigations in a large-scale enterprise environment.

Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.

TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.