As a Senior Site Reliability Engineer, you will play a critical role in architecting and maintaining the infrastructure that powers our AI-driven SaaS and private cloud offerings. You will collaborate with cross-functional teams to implement best-in-class reliability practices, optimize system performance, and ensure seamless operations for our enterprise clients.
Key Responsibilities
- Design and Build Infrastructure: Architect and implement scalable, secure, and highly available cloud infrastructure to support our SaaS and private cloud platforms.
- System Reliability: Develop and maintain systems to ensure 99.99% uptime, including monitoring, alerting, and incident response strategies.
- Automation: Drive automation of infrastructure provisioning, configuration management, and deployment pipelines to improve efficiency and reduce human error.
- Performance Optimization: Identify and resolve performance bottlenecks in distributed systems, ensuring low-latency and high-throughput operations.
- Security and Compliance: Implement security best practices and ensure compliance with industry standards (e.g., SOC 2, GDPR, HIPAA) for our private cloud deployments.
- Incident Management: Lead incident response, root cause analysis, and post-mortem processes to prevent recurrence and improve system resilience.
- Collaboration: Work closely with software engineering, data science, and product teams to align infrastructure capabilities with business needs.
- Capacity Planning: Forecast resource requirements and plan for scalable growth to meet increasing customer demand.
- Documentation: Maintain clear and comprehensive documentation of infrastructure designs, processes, and operational procedures.
Your skills and experience- Experience: 5+ years of experience in site reliability engineering, DevOps, or a related field, with a focus on cloud-based systems.
- Technical Skills:
- Expertise in cloud platforms (e.g., AWS, Azure, GCP) and infrastructure-as-code tools (e.g., Terraform, Ansible, CloudFormation).
- Proficiency in containerization and orchestration (e.g., Docker, Kubernetes).
- Strong scripting and programming skills (e.g., Python, Go, Bash).
- Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, ArgoCD).
- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
- Knowledge of OSI model
- Familiarity with networking, security, and database systems (e.g., SQL, NoSQL).
- Problem-Solving: Proven ability to troubleshoot complex, distributed systems and resolve issues under pressure.
- Communication: Excellent verbal and written communication skills, with the ability to collaborate effectively with technical and non-technical stakeholders.
- Education: Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Nice-to-Have:
- Experience in AI/ML infrastructure or high-performance computing.
- Certifications in cloud platforms or SRE-related disciplines (e.g., AWS Certified DevOps Engineer, Google SRE).
- Familiarity with private cloud deployments and hybrid infrastructure.
Why you'll love working hereImpactful Work:
- Be a key contributor to a fast-growing AI company transforming the B2B SaaS landscape.
- Take part in high-impact projects with opportunities to quickly develop your skills, lead teams, and grow your career globally.
- Your contributions are recognized with both professional advancement and strong financial rewards.
Collaborative Culture:
- Join a passionate, innovative, and young team that values diversity, creativity, and open communication.
- Experience a dynamic, democratic work environment with regular team building, sports events, and company trips that strengthen bonds and make work more enjoyable.
Competitive Compensation:
- Enjoy an attractive, negotiable salary based on your experience and capabilities, along with equity options.
- Benefit from a comprehensive package including social, health, and unemployment insurance, aligned with FPT Corporation’s standards.
Professional Growth:
- Access continuous learning opportunities, including AWS training and certification programs.
- You’ll be encouraged to take initiative, gain leadership experience, and explore international career development pathways.
Flexible Work:
- Work from anywhere with a remote-friendly policy and flexible hours designed to support your productivity and work-life balance.