Enable job alerts via email!

Site Reliability Engineer II

PROS

United States

Remote

USD 80,000 - 120,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Site Reliability Engineer II to join their dynamic team. In this pivotal role, you will monitor and enhance service performance while troubleshooting complex systems. You will collaborate with product teams to optimize reliability and scalability, leveraging your expertise in scripting and automation. Your contributions will directly impact the efficiency and stability of critical systems, making this an exciting opportunity for those passionate about technology and innovation. If you're ready to take on challenges in a supportive environment, this position is perfect for you.

Qualifications

Advanced scripting and automation skills for deployment and maintenance.
Proficiency in high-level programming languages like Ruby, Go, or Java.
Experience with monitoring tools like Prometheus and Grafana.

Responsibilities

Monitor service performance and troubleshoot complex systems.
Implement reliability enhancements and maintain documentation.
Collaborate with teams to resolve performance bottlenecks.

Skills

Operating Systems Knowledge

Networking

Database Management

Scripting and Automation

Ruby

Java

Monitoring and Alerting (Prometheus, Grafana)

Cloud Environment Optimization

RESTful API Design

API Testing Tools (Postman)

Communication Skills

Time Management

Crisis Management

Problem-Solving Skills

Teamwork

Innovation

IT Security Best Practices

Education

University Degree in Computer Science

Tools

Prometheus

Grafana

Postman

PROS, Holdings, Inc. (NYSE: PRO) provides AI-powered solutions that optimize selling in the digital economy. PROS solutions make it possible for companies to price, configure and sell their products and services in an omnichannel environment with speed, precision and consistency. Our customers, who are leaders in their markets, benefit from decades of data science expertise infused into our industry solutions.

The Site Reliability Engineer II is a primary team member who works to administer, support, troubleshoot, and problem solve complex systems and services.

A Day in the Life of the Site Reliability Engineer II:

Monitor service performance, reliability metrics, and infrastructure stability.
Perform in-depth analysis of system performance and identify areas for improvement.
Participate in disaster recovery testing and implement reliability enhancements.
Define and maintain Service Level Objectives (SLOs) and related visualizations/alerts.
Collaborate with product teams to resolve performance bottlenecks.
Implement and maintain automated deployments and self-service tools.
Create and troubleshoot automation scripts for operational tasks.
Leverage automation to improve system scalability and efficiency.
Participate in Follow-the-sun on-call rotations and respond to incidents promptly.
Troubleshoot and resolve production incidents, identifying root causes and creating detailed post-incident reports.
Work with development teams to address reliability and performance concerns.
Maintain and update documentation, including user stories and operational processes.
Share knowledge through team sessions and contribute to continuous improvement.
Implement automation for security auditing and vulnerability mitigation.
Collaborate with security teams to enhance cloud security posture.
Identify root causes of incidents and outages and participate in detailed post-incident analysis and documentation.

Required Qualifications - About you:

We are looking for candidates who possess the rare combination of the following achievements, skills, and behaviors.

Working knowledge of operating systems, networking and database management.
Advanced scripting and automation for deployment, scaling and maintenance tasks.
Proficiency in at least one high-level programming language (Ruby, Go, Java).
Knowledge of infrastructure and configuration management via automation.
Advanced skills in creating monitoring and alerting rules (Prometheus, Grafana).
Implement and optimize Cloud environments.
Knowledge of RESTful API design and development.
Familiarity with API testing tools (e.g., Postman).
Excellent communication skills.
Excellent time management, organizational skills, crisis management and problem-solving skills.
Ability to work in a team and independently.
Willing to innovate, learn and share knowledge.
University degree in computer science or related.
Developing and implementing IT security best practices and procedures.
Excellent command of English language.

It would be considered a plus:

Applicable IT Certifications.
System administrator experience.
Previous experience with cloud services - including open-source technology, software development, system engineering, scripting languages and multiple cloud provider environment.

Skills & Personal Characteristics:

Ownership
Innovation
Care

Work Environment:

Most work activities are performed in an office or home-office environment and require little to moderate physical exertion. Work activities may require periods of extended hours, critical deadlines and stressful situations. To successfully complete the tasks of this position, individuals must be able to communicate clearly (in writing and orally), comprehend business terminology, interpret numerical data.

This job description is intended to convey information essential to understanding the scope of the job and the general nature and level of work performed by job holders within this job. This job description is not intended to be an exhaustive list of qualifications, skills, efforts, duties, responsibilities or working conditions associated with the position.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs