Enable job alerts via email!

Software Engineering LMTS

Salesforce, Inc..

Dublin

On-site

EUR 80,000 - 120,000

Full time

12 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in the tech sphere is seeking a Staff Site Reliability Engineer to enhance their large-scale distributed systems. The ideal candidate will possess extensive experience in software engineering, automation, and cloud technologies, playing a vital role in scaling and maintaining critical services.

Qualifications

  • 7+ years of experience in Python, Go, or Java for automation.
  • Hands-on experience with large-scale distributed systems.
  • Strong understanding of software engineering best practices.

Responsibilities

  • Support and scale multi-cloud, multi-region services.
  • Build automation and self-healing capabilities.
  • Improve CI/CD practices and drive reliability.

Skills

Python
Go
Java
API fundamentals
Linux systems
Cloud environments
Kubernetes
Microservices
Observability solutions
Design patterns

Job description

We're looking for a Staff Site Reliability Engineer to make a significant impact on our large-scale distributed systems. If you're an experienced and passionate individual who thrives in a challenging environment and possesses a strong background in software engineering best practices, automation, and cloud technologies, we encourage you to apply. You'll be instrumental in driving the reliability, scalability, and performance of our critical services.

Responsibilities

  • Support and scale multi-cloud, multi-region services.
  • Build automation and self-healing capabilities to reduce manual operations.
  • Operate and scale monitoring, alerting, and tracing systems for proactive detection.
  • Improve CI/CD practices to accelerate safe, frequent deployments.
  • Define and implement SLIs/SLOs with engineering teams, driving reliability into system architecture.
  • Collaborate on integrating AI-driven automation and observability to enhance reliability.
  • Work within Agile teams, participating in SCRUM ceremonies and iterative delivery.
  • Lead post incident analysis, conduct postmortems, and ensure effective root cause resolution.
  • Use data to uncover trends, inform prioritization, and drive platform improvements.

Required Skills

  • 7+ years of experience in Python, Go, or Java for automation, tooling, and integration.
  • Hands-on experience designing, building and operating large scale distributed systems, identifying shortcomings and optimization opportunities
  • Demonstrated experience in developing and deploying production-grade software applications or services.
  • Proven ability to contribute directly to application codebase improvements for reliability and scalability.
  • Strong understanding of software engineering best practices, including design patterns, testing methodologies, and code reviews, applied in a production environment.
  • Excellent knowledge of Internet technologies and protocols (TCP/IP, DNS, HTTP, SSL, etc.)
  • Ability to locate and address sources of instability in high-traffic, large-scale distributed systems
  • Strong experience with API fundamentals (SOAP, REST)
  • Experience in Public Cloud environments, Kubernetes and modern container orchestration.
  • Knowledge of microservices, service mesh, and zero-trust infrastructure.
  • Solid knowledge of large-scale complex systems from a reliability and availability perspective
  • Hands-on with experience with large scale SDLC pipelines.
  • Strong Linux systems knowledge and troubleshooting skills.
  • Experience in fault modeling and tolerance, chaos engineering, performance and load testing.


Desired Skills

  • Experience operating in global, multi-tenant, or compliance-sensitive environments.
  • Understanding of SRE principles: SLIs/SLOs, availability, resiliency, and incident metrics (TTD, TTR).
  • Data-driven mindset for identifying systemic issues and improving service reliability.
  • Design and Implementation of Observability Solutions
  • Strong written and verbal communication, with emphasis on documentation and knowledge sharing.
  • Experience building and integrating AI-driven automation and observability to enhance reliability
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.