Enable job alerts via email!

Site Reliability Engineer

Insight Global

United States

On-site

USD 100,000 - 125,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A dynamic engineering team is seeking a Mid-to-Senior Level Site Reliability Engineer to ensure the reliability and scalability of mission-critical systems on Google Cloud Platform. In this role, you'll leverage your expertise in SRE principles and various tools to proactively identify and resolve issues, automate tasks, and enhance infrastructure processes. Collaborate with high-performing teams in a fast-paced environment while continuously improving your skills. This is an exciting opportunity to make a significant impact on the reliability and performance of critical applications.

Benefits

Medical insurance
Vision insurance
401(k)

Qualifications

  • 5+ years of experience in Site Reliability Engineering or DevOps.
  • Proficiency in scripting languages like Python and Bash.
  • Extensive experience with HashiCorp Terraform for infrastructure-as-code.

Responsibilities

  • Design and manage scalable infrastructure on GCP.
  • Develop monitoring and alerting solutions using Datadog.
  • Automate operational tasks using HashiCorp Terraform.

Skills

Site Reliability Engineering
Google Cloud Platform
Python
Bash
HashiCorp Terraform
Datadog
PagerDuty
Kubernetes
Docker
Linux

Education

Bachelor's degree in Computer Science

Tools

Google Cloud Spanner
GCP Cloud Logging
ChaosSearch

Job description

This range is provided by Insight Global. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$65.00/hr - $72.00/hr

Insight Global is seeking a talented and passionate Mid-to-Senior Level Site Reliability Engineer to join our dynamic engineering team. You will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical systems and applications running on Google Cloud Platform. You will leverage your deep understanding of SRE principles and GCP services, along with tools like Datadog, PagerDuty, ChaosSearch, and HashiCorp Terraform, to proactively identify and resolve potential issues, automate operational tasks, and continuously improve our infrastructure and deployment processes. As the newest member of the SRE team, you'll have the opportunity to work alongside high performers in a small, fast-paced environment, applying your existing expertise while learning new skills.

Responsibilities:

  • Design, implement, and manage scalable and highly available infrastructure on GCP, utilizing services such as Compute Engine, Kubernetes Engine (GKE), Cloud Storage, BigQuery, and Spanner.
  • Develop and maintain comprehensive monitoring, alerting, and logging solutions using Datadog and GCP Cloud Logging to provide deep visibility into system health and performance.
  • Utilize PagerDuty for effective incident management, ensuring timely response and resolution of critical issues.
  • Proactively identify potential bottlenecks and failure points through capacity planning and performance testing, leveraging ChaosSearch for log analysis.
  • Automate repetitive operational tasks using scripting languages (e.g., Python, Bash) and infrastructure-as-code tools, primarily HashiCorp Terraform, within the GCP ecosystem.
  • Participate in incident response, root cause analysis, and post-mortem reviews to drive continuous improvement and prevent future occurrences.
  • Collaborate closely with development teams to ensure that new services and features are designed, deployed, and operated with reliability and scalability in mind on GCP, including our Spanner database.
  • Define and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and track system reliability.
  • Contribute to the development and maintenance of CI/CD pipelines leveraging GCP services like Cloud Build and Artifact Registry.
  • Stay up-to-date with the latest GCP services and best practices, as well as advancements in Datadog, PagerDuty, ChaosSearch, and HashiCorp Terraform, and advocate for their adoption where appropriate.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a Site Reliability Engineering, DevOps, or similar role.
  • Significant hands-on experience designing, deploying, and managing applications and infrastructure on Google Cloud Platform, including experience with Google Cloud Spanner.
  • Strong understanding of core SRE principles and practices, such as toil reduction, automation, monitoring, and incident management.
  • Proficiency in at least one scripting language (e.g., Python, Bash).
  • Extensive experience with HashiCorp Terraform for infrastructure-as-code.
  • Experience with containerization and orchestration technologies, particularly Docker and Kubernetes (GKE preferred).
  • Proven experience with monitoring and logging tools, specifically Datadog and GCP Cloud Logging.
  • Experience with PagerDuty for incident management.
  • Experience with Linux operating systems and a solid understanding of core command-line utilities (e.g., terraform, kubectl, helm).
  • Excellent problem-solving and troubleshooting skills in complex distributed systems.
  • Strong communication and collaboration skills.

Preferred Qualifications:

  • Google Cloud certifications (e.g., Professional Cloud Architect, Professional Cloud DevOps Engineer).
  • Experience with database administration and optimization on Google Cloud Spanner.
  • Experience with networking concepts and GCP networking services (e.g., VPC, Load Balancing, Cloud DNS).
  • Experience with security best practices in a cloud environment, potentially with HashiCorp Vault.
Seniority level
  • Seniority level
    Mid-Senior level
Employment type
  • Employment type
    Contract
Job function
  • Job function
    Information Technology
  • Industries
    IT Services and IT Consulting

Referrals increase your chances of interviewing at Insight Global by 2x

Inferred from the description for this job

Medical insurance

Vision insurance

401(k)

Get notified when a new job is posted.

Sign in to set job alerts for “Site Reliability Engineer” roles.
Site Reliability Engineer L5 - Open Connect

United States $100,000.00-$720,000.00 3 weeks ago

CDN Site Reliability Engineer L4/L5 - Live Streaming, Open Connect CDN
Site Reliability Engineer L4, Netflix Technology Services

United States $100,000.00-$720,000.00 2 weeks ago

United States $147,000.00-$208,000.00 4 days ago

United States $170,000.00-$720,000.00 6 days ago

United States $64,000.00-$112,000.00 1 week ago

United States $140,000.00-$180,000.00 2 weeks ago

United States $150,000.00-$200,000.00 4 days ago

United States $170,000.00-$210,000.00 3 weeks ago

Site Reliability Engineer (Remote - Canada)

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Site Reliability Engineer

Charter Global

Remote

USD 100.000 - 150.000

4 days ago
Be an early applicant

[Hiring] Senior Site Reliability Engineer @Wisp

Wisp

Remote

USD 120.000 - 150.000

5 days ago
Be an early applicant

Lead Site Reliability Engineer (Remote -CST)

Cognizant

Juneau

Remote

USD 81.000 - 142.000

Yesterday
Be an early applicant

Senior Site Reliability Engineer (US Shift)

AlphaSense

Remote

USD 120.000 - 160.000

Yesterday
Be an early applicant

Site Reliability Engineer (FULLY REMOTE)

Splunk

Nevada

Remote

USD 82.000 - 106.000

2 days ago
Be an early applicant

Site Reliability Engineer, Customer Security

Coalition, Inc.

Remote

USD 108.000 - 164.000

6 days ago
Be an early applicant

Senior Site Reliability Engineer (US Shift)

AlphaSense, Inc.

Mission

Remote

USD 120.000 - 160.000

2 days ago
Be an early applicant

Reliability Engineer

Jones Lang LaSalle Incorporated

Chicago

Remote

USD 100.000 - 120.000

7 days ago
Be an early applicant

Principal Systems Safety Engineer Avionics (REMOTE)

Collins Aerospace

South Carolina

Remote

USD 101.000 - 203.000

4 days ago
Be an early applicant