Aktiviere Job-Benachrichtigungen per E-Mail!

Site Reliability Engineer

ZipRecruiter

Ulm

Remote

EUR 80.000 - 100.000

Vollzeit

Heute

Sei unter den ersten Bewerbenden

Erhöhe deine Chancen auf ein Interview

Erstelle einen auf die Position zugeschnittenen Lebenslauf, um deine Erfolgsquote zu erhöhen.

Zusammenfassung

A leading company in semiconductor manufacturing is seeking an expert to enhance their distributed data and compute platform infrastructure. This role involves troubleshooting, automating, and improving system reliability, critical for producing next-generation microchips for major clients like Apple and Samsung. Candidates must have extensive experience in distributed systems, scripting, and CI/CD methodologies to foster innovation and customer satisfaction.

Qualifikationen

Practical experience with and knowledge of distributed computing systems is crucial.
Ability to work with build and release infrastructure (Maven, Nexus, etc.).
Linux expertise and scripting skills are mandatory.

Aufgaben

Create awareness about methods to prevent repetitive help requests.
Support feature and service requests from the field.
Make the VCP reliable by improving system resilience.

Kenntnisse

Distributed computing systems

Networking issues

Automatic testing

CI/CD pipeline

Ausbildung

Practical experience in distributed computing systems

Linux expertise

Experience with Ansible

Knowledge of Maven, Nexus, Bamboo, Github

Scripting knowledge (e.g., Python)

Job Description

Our client is one of the world’s leading manufacturers of semiconductor chip-making equipment. A majority of the world’s microchips receive their critical lithographic patterning in machines made by this organisation. In addition, they produce metrology tools and advanced applications to analyze and optimize the performance of the customer production process.

Job Mission

Troubleshoot short-term problems and translate, develop into structural improvements on our distributed data and compute platform infrastructure. Be accurate, be precise and help drive up the aggregate availability of the installs of these distributed computing systems in Korea, Taiwan, Israel, China and the US (etc.). Be part of the computing platform that is one of the main pillars under the production of the next- microchips of Apple, Samsung and many others.

Responsibilities:

Create awareness in other teams about methods and procedures we use to help them to prevent repetitive help requests.
Help application developers to understand the infrastructure / cluster / system
“We are the team that is in charge of understanding & explaining how the system fits into the customer’s ecosystem”
Share knowledge / mindset to other teams (dev/infra engineers)
Cross functional, share knowledge between infra engineers
Contribute towards building VCP as a Product which meets our standards of quality
Increase stability and reliability of VCP by automated testing and automation
Customer satisfaction and product reliability
Improve the functionality and reliability of VCP
Translate customer ecosystem needs to engineering deliverables
Find the broken pieces of the puzzle at system/cluster level
Combination of individual ‘stories’ in a complete book
Make the VCP reliable by improving system resilience (bug-fixing and beyond)
Resolve bugs in a sustaining way (implement regression test, design structural fixes)
Ambassador of predictable component lifecycle management
Technical roadmap maintenance (App life cycle management)
Support feature and service request from the field
Suggest improvements to our technical solutions and way of working, and implement them in alignment with your team and their stakeholders

Highly valued qualifications & experiences:

Experience with DC/OS
Experience with new technology introduction @ zero downtime including data migration
Fan of automatic testing and qualification, if can be part of CI/CD pipeline.
Affinity to dig deep into the details of networking issues
Available to work (remotely) outside regular office hours when it proves that attempt to build a fail-safe system was not yet successful. We really want this to be an exception, not a rule.

Required qualifications & experiences: