Enable job alerts via email!

Lead SRE Engineer

Diverse Lynx

Princeton (NJ)

Remote

USD 130,000 - 160,000

Full time

3 days ago
Be an early applicant

Job summary

A technology solutions provider is seeking a Lead SRE Engineer for a remote position to drive system reliability and performance in applications. The ideal candidate should have over 5 years of leading SRE experience, expertise in cloud platforms, and strong skills in both automation technologies and team collaboration. Key responsibilities include designing robust monitoring systems, guiding incident responses, and implementing effective change management processes.

Qualifications

  • 5+ years of leading experience in SRE.
  • Hands-on experience with cloud platforms (AWS, Azure, GCP).
  • Experience with containers (Docker) and orchestration (Kubernetes).
  • Strong collaboration skills to work with cross-functional teams.

Responsibilities

  • Lead the end-to-end reliability and performance of applications.
  • Design and maintain monitoring and alerting systems.
  • Guide incident response efforts across application teams.
  • Define cloud strategies aligned with IT requirements.

Skills

SRE principles
.Net
Java
Microservices
Spring Boot
Angular
UNIX
C/C#
Monitoring tools
Cloud platforms
Kubernetes
CI/CD
Communication skills
Documentation skills

Tools

Dynatrace
Splunk
Elastic APM
Terraform
Service Now
Rally

Job description


Position: Lead SRE Engineer
Remote Position
Fulltime with Wipro

Responsibilities:
- System Reliability and Performance: Lead and drive end to end (Supply Chain) reliability, availability, and performance of applications in Digital Experience.
- Monitoring and Alerting: Help in designing, to implement, and in maintaining robust monitoring and alerting systems to proactively identify and resolve issues.
- Capacity Planning: Help in capacity planning, ensuring that systems can handle current and future workloads.
- Incident Response: Guide Org level application teams in incident response efforts, ensuring quick and effective resolution of issues.
- Performance Tuning: Help teams in gathering and analyzing metrics from application monitoring logs to assist in performance tuning and identifying the bottlenecks.
- Post-Incident Reviews: Help in post-incident(P1/P2) reviews to identify root causes and prevent future incidents.
- Security: Help application teams to adopt industry standard best practices in managing security certs, Secrets and Non-User Id’s to avoid any issues and also outages.
- Change Management: Help application teams to implement robust change management processes to ensure that changes to the system are deployed safely and reliably.
- PS Readiness: Help application teams to get ready for peak season in terms of overall E2E system resiliency and redundancy to handle expected peak usage volumes.
- War room Playbooks: Help teams in preparation of playbook with War room scenarios.
- Auto Failover & Auto Scaling: Help application teams in adopting best auto failover and auto scaling strategies to maintain overall system resiliency.
- Collaboration with Developers: Work with application development teams to understand their needs, identify potential reliability issues, and improve the software development lifecycle.
- Cloud: Define and develop Cloud strategy for the enterprise, focusing on AWS, aligned with IT requirements

Requirements:
- A solid understanding of SRE principles and at least 5 years of leading experience to guide SRE engineers.
- Experience in .Net , Java, Microservices, springboot, Angular, UNIX, C,C#
- Experience leading SRE teams or projects.
- Monitoring & Observability APM tools like Dynatrace Clod, Splunk, Elastic APM, Interlink and Grafana.
- Hands-on experience with cloud platforms (e.g., AWS, Azure, GCP) and their services.
- Experience with containers (Docker) and container orchestration (Kubernetes
- AI tools like GitHub Copilot and Chat Playground.
- Incidents management tools like Service now.
- Rally.
- Understanding of CI/CD and using GitHub actions.
- Strong communication and collaboration skills to work effectively with cross-functional teams.
- DB: MongoDB and MySQL
- Proficiency in automation technologies and tools like Terraform.
- Good Documentation skills.

Outcomes:
- Increased system Reliability and 99.999% availability.
- E2E (Supply chain) Resiliency and Redundancy.
- Improved Scalability.
- Faster Incident Resolution or Restoration.
- Continuous Improvement.








Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.