Enable job alerts via email!

Remote Site Reliability Engineer

Insight Global

Orlando (FL)

Remote

USD 100,000 - 130,000

Full time

Today
Be an early applicant

Job summary

A leading technology firm is seeking a Site Reliability Engineer to oversee the reliability and performance of its backend data platform. The ideal candidate will deeply engage with Python and AWS services, ensuring data integrity through observability tools, while monitoring custom pipelines. This role offers a commitment to fostering a diverse and inclusive workplace.

Qualifications

  • Strong proficiency in Python, especially in backend and infrastructure contexts.
  • Experience with AWS services, particularly Lambda, S3, and Kinesis.
  • Familiarity with monitoring tools such as Datadog.

Responsibilities

  • Monitor and maintain a custom data pipeline from ingestion to delivery.
  • Build and tune Datadog dashboards and alerts.
  • Investigate and resolve issues in the pipeline.

Skills

Python
AWS (Lambda, S3, Kinesis)
Datadog
Docker
Infrastructure as Code

Tools

AWS CloudWatch
Snowflake
Terraform
Job description

Job Description

We are seeking a Site Reliability Engineer (SRE) to support the reliability, observability, and performance of our backend data platform. This platform ingests high-volume data from hotel systems, ticketing, and MagicBand readers, flowing through custom pipelines into our data warehouse. The ideal candidate will have a strong background in Python, cloud-native technologies, and observability tools, with a focus on ensuring data integrity and system reliability across multiple touchpoints.

Responsibilities:
  • Monitor and maintain a custom data pipeline from ingestion to delivery, ensuring data integrity and performance.
  • Instrument and observe systems using cloud serverless technologies, including:
    • AWS Lambda
    • Amazon S3
    • Amazon Kinesis
    • Snowflake
    • Docker containers on ECS
  • Migrate observability workflows from AWS CloudWatch to Datadog, centralizing metrics, dashboards, and alerts.
  • Build and tune Datadog dashboards and alerts to support SLAs and system health.
  • Graph and analyze metrics to ensure pipeline reliability and performance.
  • Investigate and resolve issues in the pipeline, ensuring expected behavior across all stages.
  • Work within the Python codebase (~20–40% of time) to:
    • Create coherent tickets for issues
    • Fix bugs and improve instrumentation
  • Perform click-ops tasks (~60–80% of time) in Datadog, including:
    • Dashboard creation and maintenance
    • Access request handling
    • Alert tuning and incident response

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.

Skills and Requirements
  • Strong proficiency in Python, especially in backend and infrastructure contexts.
  • Experience with AWS services (Lambda, S3, Kinesis, CloudWatch).
  • Familiarity with Datadog for monitoring, alerting, and dashboarding.
  • Understanding of data pipelines, data integrity, and observability best practices.
  • Experience with Docker and ECS in production environments.
  • Familiarity with infrastructure as code (e.g., Terraform, CloudFormation).
  • Exposure to SLAs, incident response, and data reliability engineering. Experience with Snowflake and data warehouse integrations.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.