Enable job alerts via email!

Senior Site Reliability Engineer

Xilis, Inc.

Durham (NC)

Remote

USD 120,000 - 150,000

Full time

Today
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading biotech company in Durham is seeking a Senior Site Reliability Engineer to enhance their cloud infrastructure for cancer research. The role involves optimizing AWS services, managing Kubernetes clusters, and ensuring system reliability. Candidates should have extensive experience in DevOps and a passion for innovation in precision oncology.

Benefits

Health Insurance
Vision Insurance
Dental Insurance
Retirement Plans
Unlimited PTO

Qualifications

  • 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering.
  • Expert-level knowledge of AWS services and architecture.

Responsibilities

  • Build and manage AWS cloud infrastructure with appropriate security controls.
  • Implement infrastructure as code using Terraform.
  • Set up monitoring, logging, and alerting systems.

Skills

AWS
Kubernetes
Terraform
Git
Agile Development
Troubleshooting

Tools

Prometheus
Grafana
CloudWatch
Argo
Metaflow

Job description

Xilis, Inc. is an innovation-driven biotech company developing its proprietary MicroOrganoSphere (“MOS”) Technology for functional precision oncology. Xilis’ MOS Technology enables rapid and scalable generation of patient tumor models that retain patient-specific tumor biology and tumor microenvironment, representing one of the most translationally-relevant ex vivo technologies for precision oncology drug discovery and development. Located in Research Triangle Park, Durham, NC, Xilis is building a functional precision medicine platform that incorporates scaled multi-modal profiling of therapeutic activity and AI/ML-enabled analytics to catalyze functional precision medicine drug discovery, development and diagnostics. Collectively, Xilis aims to harness its MOS Platform to enable development of the most effective therapeutics and guide them to the right patients at the right time.

Impact

As Senior Site Reliability Engineer, you'll power our breakthrough cancer research by delivering reliable, scalable, and cost-effective cloud infrastructure. Working in our fast-paced startup environment, you'll ensure our scientists and engineers can innovate without constraints, keeping our critical systems running smoothly while continuously improving our technology platform.


Main Objectives
  • Optimize our AWS cloud infrastructure for research applications, focusing on performance, reliability, and cost while managing infrastructure through code
  • Implement monitoring, alerting, and disaster recovery systems to ensure high availability
  • Collaborate with data science teams on ML pipelines and infrastructure
  • Drive technical excellence in cloud operations, MLOps, and deployment pipelines, containerization, and security
Responsibilities
  • Build and manage our AWS cloud infrastructure (compute, storage, networking, databases) with appropriate security controls and IAM policies
  • Deploy and manage Kubernetes clusters utilizing tools like Karpenter, Helm, and Prometheus
  • Implement infrastructure as code using Terraform, with expertise in modules, state management, and custom providers
  • Set up monitoring, logging, and alerting systems using tools like Prometheus, Grafana, CloudWatch, etc
  • Actively track and optimize cloud costs across all environments
  • Support MLOps infrastructure by maintaining data pipelines using tools such as Argo and Metaflow
  • Troubleshoot infrastructure issues and execute routine maintenance tasks including patching and backups
  • Document critical infrastructure configurations and operational procedures
Requirements
  • 5+ years of experience in software engineering with a focus on DevOps and Site Reliability Engineering
  • Expert-level knowledge of AWS services and architecture with strong coding skills
  • Deep expertise with Kubernetes ecosystem (Karpenter, Helm, Istio, etc.)
  • Advanced skills with Terraform and building modern CI/CD pipelines in an agile environment
  • Strong understanding of networking (VPCs, VPNs, Direct Connect) and security best practices
  • Fully proficient with Git and agile development practices
  • Exceptional troubleshooting skills and cost optimization experience
  • Demonstrated ability to lead projects and initiatives in a fast-paced environment
Bonus Points
  • Experience with MLOps infrastructure (Argo, Metaflow, Airflow)
  • Knowledge of serverless Kubernetes platforms (Knative, KubeVirt, OpenFaaS)
  • Strong Python backend engineering skills with ability to contribute to application development tasks
  • Prior experience in fast-moving startups
  • Domain experience in life sciences or biotechnology
  • Enterprise networking experience (Palo Alto firewalls, Juniper switches/APs)

Xilis was created when its three founders — an engineer, a physician, and a biologist — decided to come together and commercialize their technology to transform cancer care. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We do not discriminate on the basis of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status. Furthermore, even if your work experience isn't perfectly aligned with what we've described above, if you're excited about what we're building then we want to talk to you!

Xilis offers comprehensive health, vision, dental & retirement plans, and unlimited PTO. We are a remote-friendly team: our headquarters are in Durham, North Carolina, but we have team members across the US (and beyond!).

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineers

Centene Corporation

St. Louis

Remote

USD 112,000 - 159,000

Today
Be an early applicant

Senior Site Reliability Engineer (Azure)

Ignitec Inc

Remote

USD 140,000 - 160,000

Yesterday
Be an early applicant

[Hiring] Senior Site Reliability Engineer @Wisp

Wisp

Remote

USD 120,000 - 150,000

Today
Be an early applicant

Senior Site Reliability Engineer

Censys

Ann Arbor

Remote

USD 145,000 - 195,000

2 days ago
Be an early applicant

Senior Site Reliability Engineer

General Motors

Remote

USD 90,000 - 130,000

Today
Be an early applicant

Senior Site Reliability Engineer

Exygy Inc

Remote

USD 120,000 - 125,000

Today
Be an early applicant

Senior Site Reliability Engineer

Firsthand

Remote

USD 100,000 - 130,000

Today
Be an early applicant

Senior Site Reliability Engineer

General Motors

Austin

Remote

USD 100,000 - 130,000

Yesterday
Be an early applicant

Senior Site Reliability Engineer II

ConnectWise

Remote

USD 100,000 - 130,000

2 days ago
Be an early applicant