D.Engage is a leading SaaS company dedicated to delivering innovative solutions that drive digital engagement and enhance customer experiences. Our team is passionate about technology and committed to fostering an environment where talent can thrive and grow. Currently, we are looking for a SaaS Resilience Manager as part of our technology team, who is agile, results-driven, customer-obsessed, and loves learning!
This position offers a valuable opportunity for an engineer to enhance their expertise and contribute to impactful projects. Here are the responsibilities for this position:
Key Responsibilities:
- Resilience Planning and Strategy:
- Participate in developing and implementing a comprehensive service resilience strategy for all SaaS products.
- Design and maintain disaster recovery and business continuity plans.
- Conduct regular risk assessments and impact analyses to identify vulnerabilities and mitigate risks.
- Ownership of Production Environment:
- Take ownership and responsibility for the production environment, including cloud and on-premise infrastructure.
- Monitor production environments in collaboration with the VP of Development.
- Work with the VP of Security to ensure the security of the production environment.
- Team Building and Improvement:
- Build and lead a high-performing resilience team, continuously improving its quality.
- Train and enhance the skills of technical support teams, including preparing training materials.
- Provide feedback to teams on problem detection and troubleshooting steps (logging, monitoring, health checks).
- Service Monitoring and Incident Management:
- Establish and manage robust monitoring systems to detect and respond to service disruptions promptly.
- Lead incident response efforts, including root cause analysis, resolution, and post-incident reviews.
- Develop and maintain incident response playbooks and procedures.
- Infrastructure and Performance Optimization:
- Collaborate with IT and engineering teams to design resilient infrastructure and applications.
- Implement redundancy, failover, and load balancing strategies to ensure high availability.
- Continuously monitor and optimize system performance, capacity, and scalability.
- Collaboration and Communication:
- Assist product and development teams with analysis when necessary.
- Analyze large-scale bugs and transfer them to relevant teams.
- Troubleshoot server problems with teams when necessary.
- Provide regular updates on service resilience status, metrics, and improvements to stakeholders.
- Fix small-scale bugs (minimum 3 years coding experience required).
- Analyze large-scale bugs and coordinate with relevant teams for resolution.
- Compliance and Documentation:
- Ensure compliance with relevant industry standards and regulations.
- Maintain comprehensive documentation of resilience strategies, processes, and incident responses.
- Participate in audits and reviews as required.
Requirements:
- Bachelor's degree in Computer Science, Software Engineering, or a related field.
- Proficiency in .NET framework.
- Strong knowledge of servers such AWS, Azure, and on-premise servers.
- Familiarity with version control tools like Git.
- Experience with complex L3 queries and solutions related to server scalability.
- Interest and enthusiasm for technology processes.
- Collaborative skills and a team-oriented mindset.
- Accountability and commitment to the job.
- Willingness to learn and adapt to new technologies.
- Fast learning ability and problem-solving skills.
- Effective communication skills and analytical thinking.
We provide:
D.Engage is an equal opportunity employer committed to diversity and creating an inclusive workplace.