Enable job alerts via email!

Datacenter Observability and Site Reliability Engineer

ApTask

United States

Remote

Full time

2 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company seeks a Datacenter Observability and Site Reliability Engineer to enhance their infrastructure solutions. The role involves designing observability systems, implementing SRE practices, and ensuring compliance with security standards. Candidates should have extensive experience in datacenter environments and proficiency in tools like Prometheus and Kubernetes. This position offers a competitive salary and the potential for remote work.

Qualifications

  • 8+ years of experience in datacenter observability and site reliability engineering.
  • Proven experience in managing large-scale datacenter environments.

Responsibilities

  • Design and maintain observability solutions for datacenter infrastructure.
  • Implement SRE best practices for reliability and scalability.
  • Develop automation scripts for infrastructure management.

Skills

Observability tools proficiency
SRE practices
Strong programming skills
Problem-solving skills
Excellent communication skills

Education

Bachelor's or Master's in Computer Science or related field

Tools

Prometheus
Grafana
Kubernetes
Docker
Terraform
AWS
Azure
GCP

Job description

Datacenter Observability and Site Reliability Engineer
Datacenter Observability and Site Reliability Engineer

This range is provided by ApTask. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$75.00/hr - $85.00/hr

Direct message the job poster from ApTask

Talent Acquisition Specialist - Connect me for the W2 and Fulltime roles | Helping Jobseekers

Role: Datacenter Observability and Site Reliability Engineer

Location: Remote work, preferably candidates that are located in Seattle or Mountain View that can work in Korea time zone as needed. There is potential that these roles may have to travel to Soul at some point for training and meeting the team out there.

Skillset Description Summary: This team is responsible for the overall site reliability solution, including alerts, monitoring and incident management related to hardware and Kubernetes infrastructure layer. This team includes L1, L2 and L3 support.

Roles and Responsibilities:

Observability and Monitoring-

  • Design, implement, and maintain observability solutions for datacenter infrastructure.
  • Develop, deploy, and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting.
  • Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
  • Develop and optimize monitoring systems to ensure high availability and performance.
  • Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.

Site Reliability Engineering (SRE)-

  • Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
  • Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
  • Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.

Performance Optimization-

  • Analyze and optimize the performance of datacenter systems and applications.
  • Implement best practices for resource utilization and efficiency.
  • Work closely with other engineering teams to understand and meet their observability and reliability requirements.
  • Collaborate with hardware and software vendors to evaluate and integrate new technologies.

Security and Compliance:

  • Ensure that observability and reliability solutions comply with security policies and industry standards.
  • Implement and maintain security measures to protect data and infrastructure.

Troubleshooting and Support:

  • Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
  • Develop and maintain documentation for troubleshooting procedures and best practices.
  • Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
  • Continuously improve the reliability, scalability, and performance of datacenter services.

Qualifications:

Education-

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Experience:

  • 8+ years of experience in datacenter observability and site reliability engineering.
  • Proven experience in managing and optimizing large-scale datacenter environments.

Technical Skills-

  • Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
  • Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
  • Strong programming and scripting skills (e.g., Python, Go, Bash).
  • Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and reliability services.

Soft Skills-

  • Strong problem-solving skills and attention to detail.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, dynamic environment.
Seniority level
  • Seniority level
    Mid-Senior level
Employment type
  • Employment type
    Contract
Job function
  • Job function
    Information Technology
  • Industries
    IT Services and IT Consulting

Referrals increase your chances of interviewing at ApTask by 2x

Continue with Google Continue with Google

Sr. SRE(100% Remote | Top Global Media Company)

United States $140,000 - $150,000 1 day ago

United States $171,800 - $375,900 4 days ago

Lakeland, FL
$132,000.00
-
$198,000.00
2 weeks ago

Solution Architect, Hybrid Infrastructure (Cloud Foundations/DevOps/IaC)

United States
$90,000.00
-
$145,000.00
4 weeks ago

Senior Software Engineer - Audio Processing & Networking Specialist

Carmel, IN
$105,700.00
-
$158,500.00
1 week ago

Senior Software Engineer (Consumer - Growth & App Infra, Growth Foundations)

Mountain View, CA
$180,000.00
-
$240,000.00
1 week ago

United States
$135,900.00
-
$153,000.00
2 weeks ago

United States
$160,000.00
-
$175,000.00
1 week ago

United States
$210,000.00
-
$240,000.00
5 months ago

Senior Engineering Manager - DevOps & Infrastructure

United States
$225,000.00
-
$315,000.00
2 weeks ago

Newton Centre, MA
$143,000.00
-
$154,000.00
2 days ago

United States $145,000 - $170,000 3 months ago

United States $213,200 - $295,000 6 days ago

Technical Trainer - Cloud/DevOps Engineer
Technical Manager - Senior Software Engineer

United States $150,000 - $225,000 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.