Enable job alerts via email!

Site Reliability Engineer

W. R. Berkley Corporation

Wilmington (DE)

On-site

USD 90,000 - 130,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Site Reliability Engineer (SRE) to enhance the reliability and performance of their software systems. This dynamic role involves collaboration with cross-functional teams to implement best practices in cloud and on-premises environments. You will utilize your expertise in scripting languages and observability tools to monitor system health and automate processes, ensuring seamless operations. Join a forward-thinking company that values innovation and offers opportunities for career growth in a supportive environment. If you're passionate about technology and eager to make an impact, this is the perfect opportunity for you.

Qualifications

5+ years of IT experience in Development, Operations, and Infrastructure support.
Strong expertise in observability tools and logging architectures.
Experience with cloud computing principles and hybrid resiliency solutions.

Responsibilities

Ensure reliability, scalability, and performance of software systems.
Implement monitoring and alerting systems for proactive health checks.
Collaborate with teams to enhance reliability and disaster recovery.

Skills

Python

Bash

JavaScript

Shell Scripting

Problem-Solving

Communication

Education

Bachelor's degree in Computer Science

Equivalent experience

Tools

Dynatrace

Datadog

ELK Stack

GitHub Actions

Terraform

Ansible

Chef

Puppet

Kubernetes

Helm

Prometheus

Company Details

Company URL: https://www.berkleytechnologyservices.com/

Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60+ operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.

Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand but proactively plan for the “What’s Next” in our industry and beyond.

With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who genuinely care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you had never known existed. Come join us as we push forward into the future of industry’s leading technological solutions.

Berkley Technology Services: Right Team, Right Technology, Simple and Secure.

Responsibilities

As a Site Reliability Engineer (SRE), you will play a crucial role in ensuring the reliability, scalability, and performance of our software systems. Collaborating closely with cross-functional teams, you will set and enforce SRE best practices, ensuring the scalability, reliability, and security of our cloud and on-premises environments. This technically broad role requires a strong understanding of the entire technology stack (network, storage, OS, virtualization, database, development, applications) to observe, monitor, troubleshoot, and automate activities within the Berkley environment.

Define and Track OKRs:Establish and monitor reliability and observability OKRs, including Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Monitoring and Alerting:Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick incident response.
AIOps Implementation:Enable auto-response, self-healing, and anomaly trend analysis through AIOps functionality.
Automation Solutions:Develop and implement automation solutions to eliminate “toil,” streamline processes, reduce manual interventions, and enhance overall efficiency.
Performance Optimization:Identify and address performance bottlenecks in applications and infrastructure to improve efficiency and user experience.
Incident Management:Work closely with incident management to quickly resolve system outages or performance issues, minimizing downtime and user impact.
Collaboration:Collaborate actively with development and operations teams to implement observability and resiliency requirements for smooth software deployment and operation.
Reliability Improvement:Enhance reliability by identifying and addressing gaps in our architecture, services, and tooling.
Disaster Recovery:Modernize disaster recovery programs for both on-premises and cloud-based Berkley solutions.

Qualifications

Experience:5+ years of IT experience in Development, Operations, and Infrastructure support; 3+ years in Site Reliability Engineering and DevOps.
Scripting Languages:Proficiency in Python, Go, Bash, JavaScript, and Shell Scripting.
Observability Tools:Strong expertise in Dynatrace, Datadog, ELK Stack.
Logging and Monitoring:Practical expertise in creating and implementing logging and monitoring architectures.
Resiliency Solutions:Expertise in designing and implementing on-premises, cloud, and hybrid resiliency solutions (HA, AA, AP), disaster recovery, and business continuity planning.
Cloud Computing:Deep understanding of cloud computing principles (IaaS, PaaS, SaaS).
Kubernetes:Experience with Kubernetes and auto-scaling tools, including Helm and Prometheus.
GitOps and CI/CD:Proficient in leveraging GitOps with containerization technologies and CI/CD pipelines.
Automation Tools:Experience with infrastructure automation and configuration management tools (GitHub Actions, Terraform, Ansible, Chef, Puppet).
Security Best Practices:Solid understanding of security best practices in on-premises, cloud, and hybrid environments.
Industry Standards:Understanding of industry-standard security frameworks and ability to interpret them for Berkley environments.
Problem-Solving:Excellent problem-solving skills and ability to troubleshoot complex issues in a distributed hybrid environment.
Communication:Strong communication skills to collaborate effectively with cross-functional teams and convey technical concepts to non-technical stakeholders.
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).

The Company is an equal employment opportunity employer.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer

General Dynamics Mission Systems

Aurora

Remote

USD 129,000 - 141,000

3 days ago

Be an early applicant