Enable job alerts via email!

Cloud Infrastructure Site Reliability Engineer (SRE)

Intelliswift - An LTTS Company

Berkeley Heights (NJ)

On-site

USD 100,000 - 130,000

Full time

Today

Be an early applicant

Job summary

A technology services firm is looking for a Cloud Infrastructure Site Reliability Engineer (SRE) in Berkeley Heights, NJ. You will manage cloud infrastructure, ensuring reliability and performance. The ideal candidate has over 3 years' experience in software development, strong cloud platform knowledge, and is proficient in programming. This position emphasizes automation and collaboration across teams, offering a dynamic work environment with competitive compensation.

Qualifications

3+ years of experience in software development.
Experience administering cloud platforms (AWS, GCP, Azure).
Deep understanding of observability tools.

Responsibilities

Design, build, and maintain scalable cloud infrastructure.
Develop automation for provisioning and monitoring.
Monitor system reliability and address issues proactively.

Skills

Cloud platform expertise (AWS, GCP, Azure)

Programming proficiency (Python, Go, Java, C++)

Linux systems knowledge

Problem-solving skills

Incident management

Education

Bachelor’s degree in Computer Science, Engineering, or related field

Tools

Terraform

CloudFormation

Ansible

Monitoring tools

Direct message the job poster from Intelliswift - An LTTS Company

Job Posting Title: Cloud Infrastructure Site Reliability Engineer (SRE)

Overview

Position Summary: As a Cloud Infrastructure Site Reliability Engineer (SRE) with expertise in multiple public cloud service provider platforms, you will be responsible for operating infrastructure solutions, following the principles and practices pioneered by Google’s SRE model. Your work will ensure our cloud services meet uptime, reliability, and performance targets, and you will drive automation and continuous improvement across our production environments. This role will involve collaborating with cross-functional teams to enhance our cloud reliability posture and streamline processes through automation.

Responsibilities

Design, build, and maintain highly available, scalable, and secure cloud infrastructure on platforms such as AWS, GCP, or Azure.
Develop and implement automation for provisioning, monitoring, scaling, and incident response using Infrastructure-as-Code tools (e.g., Terraform, CloudFormation, Ansible).
Monitor system reliability, capacity, and performance; proactively detect and address issues before they impact users.
Respond to production incidents, participate in on-call rotations, and lead post-incident reviews to drive root cause analysis and reliability improvements.
Collaborate with software engineering and security teams to ensure new services and features are production-ready and meet reliability standards.
Build and maintain tools for deployment, monitoring, and operations; automate manual processes to reduce toil.
Document operational processes and system architectures to ensure knowledge sharing and repeatability.
Continuously evaluate and implement new technologies to improve system reliability, security, and efficiency.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
3+ years of experience in software development with proficiency in at least one programming language (e.g., Python, Go, Java, C++).
Experience administering cloud platforms (AWS, GCP, Azure), including networking, security, containerization, storage, data management, and serverless technologies.
Solid understanding of Linux systems, networking fundamentals, virtualized, and distributed systems, file systems, system processes and configurations.
Deep understanding of observability (monitoring, alerting, and logging) tools in cloud environments. Ability to set up and maintain monitoring dashboards, alerts, and logs.
Familiarity with Continuous Integration/Continuous Deployment (CI/CD) tools for automated testing, deployments, provisioning, and observability.
Ability to manage and respond to incidents, perform root cause analysis, and implement post-mortem reviews.
Understanding of setting, monitoring, and maintaining Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) for system reliability.

Additional Qualifications

Experience working with enterprise-scale financial services or other regulated industries
5+ years of experience in SRE, DevOps, infrastructure, or cloud engineering roles, preferably supporting large-scale, distributed systems.
Excellent problem-solving, troubleshooting, and communication skills.
Experience leading technical projects or mentoring junior engineers.
Certifications: Certified Engineer, DevOps, SRE, CSREF

Additional location and compensation details may be provided during the interview process.

For visibility, a few nearby postings include: Berkeley Heights, NJ; New York, NY; Jersey City, NJ; Holmdel, NJ, among others.

Thanks for your interest in Intelliswift - An LTTS Company.

Berkeley Heights, NJ is listed as a location in related postings and is subject to the posting's geographic relevance.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.