Enable job alerts via email!

Site Reliability Engineer, AI/ML Platforms

Adobe Systems GmbH

San Jose (CA)

On-site

USD 133,000 - 242,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in digital experiences is seeking a Site Reliability Engineer for Adobe's AI Training and Inference Platforms. Join a dynamic team working on enhancing the reliability and performance of Generative AI solutions, where your expertise in automation and scaling systems will drive impactful changes and innovations in machine learning deployment.

Benefits

Equity awards
Short-term incentives
Career development opportunities

Qualifications

  • 5+ years relevant industry experience.
  • Experience in building and scaling distributed systems.
  • Production level expertise with container orchestration engines.

Responsibilities

  • Identify and implement methodologies to increase reliability and efficiency.
  • Ensure high uptime and Quality of Service (QoS).
  • Support and maintain globally distributed, multi-cloud environments.

Skills

Reliability Engineering
Security
Scalability
Efficiency
Automation

Education

Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field

Tools

Kubernetes
Ansible
Terraform
Prometheus
Elastic Stack

Job description

Our Company

Changing the world through digital experiences is what Adobe’s all about. We give everyone—from emerging artists to global brands—everything they need to design and deliver exceptional digital experiences! We’re passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen.

We’re on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!

The Opportunity


We're looking for an outstanding, Site Reliability Engineer for Adobe’s AI Training and Inference Platforms within Adobe Firefly. You will be part of a team of Site Reliability Engineers closely working with the Engineering teams on building, scaling, and securing the AI Platform. This enables the Firefly product teams to easily manage and deploy Machine Learning capabilities used by Adobe client applications.

The Applied Research groups from Adobe Research and other App Teams in Adobe will deploy thousands of models onto this platform in a variety of lifecycle stages (early research, development, productization, optimization, etc). This platform will offer ML model training and serving at scale, with high-cost efficiency, and on a wide variety of hardware platforms across multiple clouds.


What You'll Do

  • Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency.
  • Ensure the highest uptime and Quality of Service (QoS) for Adobe’s customers through operational excellence.
  • Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality.
  • Support and maintain globally distributed, multi-cloud (public and/or private) environments.
  • Automate common, repeatable tasks at a large scale to streamline operational procedures.
  • Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
  • Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service.


What You’ll Need to Succeed

  • A Bachelor's or Master's degree in Computer Science, Electrical Engineering, a related field, and 5+ years relevant industry experience.
  • You excel in undefined environments and get excited about finding pragmatic solutions to complex technical or organizational challenges.
  • You keep up with the industry trends and grow your knowledge and skills to solve technical problems.
  • Experience in building and scaling distributed systems, as well as experience with containerization and orchestration technologies like Kubernetes.
  • Production level expertise with containerization orchestration engines (e.g. Kubernetes) and proven understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
  • Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go
  • Good knowledge of infrastructure configuration management tools like Ansible and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic Stack.
  • An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus.

#FireflyGenAI

Our compensation reflects the cost of labor across several U.S. geographic markets, and we pay differently based on those defined markets. The U.S. pay range for this positionis $133,900 -- $242,000 annually. Paywithin this range varies by work locationand may also depend on job-related knowledge, skills,and experience. Your recruiter can share more about the specific salary range for the job location during the hiring process.

At Adobe, for sales roles starting salaries are expressed as total target compensation (TTC = base + commission), and short-term incentives are in the form of sales commission plans. Non-sales roles starting salaries are expressed as base salary and short-term incentives are in the form of the Annual Incentive Plan (AIP).

In addition, certain roles may be eligible for long-term incentives in the form of a new hire equity award.

State-Specific Notices:

California:

Fair Chance Ordinances

Adobe will consider qualified applicants with arrest or conviction records for employment in accordance with state and local laws and “fair chance” ordinances.

Colorado:

Application Window Notice

There is no deadline to apply to this job posting because Adobe accepts applications for this role on an ongoing basis. The posting will remain open based on hiring needs and position availability.

Massachusetts:

Massachusetts Legal Notice

It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.

Adobe is proud to be anEqual Employment Opportunityemployer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other applicable characteristics protected by law.Learn more.

Adobe aims to make Adobe.com accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, emailaccommodations@adobe.comor call (408) 536-3015.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

[Hiring] Engineering Manager, AI Platform @Vanta

Vanta

Remote

USD 130,000 - 180,000

Today
Be an early applicant

Engineering Manager, AI Platform

Vanta

Remote

USD 130,000 - 180,000

18 days ago

Site Reliability Engineer, AI/ML Platforms

Adobe Inc.

California

On-site

USD 133,000 - 242,000

30+ days ago

Site Reliability Engineer, AI/ML Platforms

Adobe

California

On-site

USD 133,000 - 242,000

30+ days ago

Senior Data Platform Engineer – Data & AI

Equinix

Remote

USD 120,000 - 160,000

27 days ago

Platform Engineer II

GlaxoSmithKline

San Francisco

On-site

USD 132,000 - 222,000

3 days ago
Be an early applicant

Test and Reliability Engineer

Ceribell │ AI-Powered Point-of-Care EEG

Sunnyvale

On-site

USD 125,000 - 135,000

30+ days ago

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

TrueFoundry

San Mateo

On-site

USD 167,000 - 251,000

22 days ago

Senior Platform Engineer

ZipRecruiter

Palo Alto

On-site

USD 170,000 - 210,000

24 days ago