Enable job alerts via email!

Major Incident Manager, Eng Support-Incident Management Team - USDS

TikTok

Los Angeles (CA)

On-site

USD 80,000 - 120,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is seeking an Incident & Problem Manager to oversee critical incident resolutions and enhance data security protocols. This role is pivotal in ensuring minimal disruption during high-severity incidents, collaborating with cross-functional teams to implement effective solutions. The ideal candidate will bring experience in incident management, strong problem-solving skills, and a technical background in cloud architecture. Join a dynamic environment where your contributions will directly impact the safety and satisfaction of millions of users. If you are passionate about technology and thrive in a fast-paced setting, this opportunity is perfect for you.

Qualifications

  • 2+ years in Incident Management with a focus on high-severity incidents.
  • Strong communication skills to engage diverse audiences effectively.

Responsibilities

  • Lead resolution of critical incidents to minimize customer impact.
  • Drive process improvements and monitor incident management programs.

Skills

Incident Management
Problem-Solving
Communication Skills
Cloud Architecture
Monitoring Solutions
Technical Troubleshooting
Flexibility

Education

Bachelor’s degree in Computer Science
Equivalent work experience

Tools

Grafana
Kubernetes

Job description

Get AI-powered advice on this job and more exclusive features.

Responsibilities

About TikTok U.S. Data Security
TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. U.S. Data Security (“USDS”) is a subsidiary of TikTok in the U.S. This new, security-first division was created to bring heightened focus and governance to our data protection policies and content assurance protocols to keep U.S. users safe. Our focus is on providing oversight and protection of the TikTok platform and U.S. user data, so millions of Americans can continue turning to TikTok to learn something new, earn a living, express themselves creatively, or be entertained.

About the Role
The Incident & Problem Manager will oversee the resolution of high-priority incidents, ensuring minimal disruption and swift resolution. This includes owning incident escalations, documenting processes, and collaborating with cross-functional teams to identify root causes and implement short term and long-term solutions.

Responsibilities:

  1. Serve as a subject matter expert in incident management, leading the resolution of critical incidents to minimize customer/business impact.
  2. Partner with SRE teams and service owners to ensure timely resolution of high-severity incidents and create high-quality RCAs.
  3. Act as an escalation point for critical incidents and lead crisis response processes as required.
  4. Prioritize incidents based on customer and operational impact, ensuring optimal resource allocation for swift resolution.
  5. Monitor, evaluate, and report on incident management programs, identifying trends and areas for improvement.
  6. Drive process improvements to minimize incident frequency and severity while enhancing efficiency.
  7. Implement automated procedures to capture incident data consistently, supporting data-driven decision-making.
  8. Lead post-incident reviews with cross-functional teams, identifying actionable insights and process optimizations.
  9. Partner with senior leaders to facilitate incident management communications and project delivery.
  10. Generate communications tailored for technical and non-technical audiences, including customer-facing updates.
  11. Collaborate with cross-functional teams to ensure effective containment and remediation strategies.
  12. Ability to work Sunday to Thursday, from 5 PM PT to 2 AM PT.
  13. Provide rotational on-call support (24x7x365) to ensure incidents are handled promptly and effectively.
  14. Stay updated on infrastructure dependencies and emerging technologies to proactively mitigate risks.

Qualifications

Minimum Qualifications:

  1. Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.
  2. 2+ years of experience in Incident Management, including leadership of high-severity incidents.
  3. Experience with monitoring solutions and applications such as Grafana.
  4. Technical knowledge of cloud architecture and design.
  5. Proficiency in troubleshooting techniques and problem-solving in a 24x7x365 environment.
  6. Strong oral and written communication skills, with the ability to effectively communicate with diverse audiences.
  7. Must be willing to be flexible with working hours depending on the needs of the business.

Preferred Qualifications:

  1. Proven ability to lead incident response calls confidently, driving toward resolution and minimizing downtime.
  2. Experience analyzing incident trends and operational metrics to inform prevention strategies.
  3. Expertise in micro-services architecture, and Linux environment with foundation knowledge of Kubernetes.
  4. Demonstrated success in process improvement, including conducting root cause analyses and implementing efficient solutions.
  5. Strong interpersonal and influencing skills to collaborate effectively across teams without direct authority.
  6. Familiarity with leading investigations in a large-scale enterprise environment.

Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.

T

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.