Job Description & Requirements
We are seeking an experienced DevOps Engineer to join our dynamic team in Singapore. The ideal candidate will be responsible for designing, implementing, and maintaining our cloud infrastructure, CI/CD pipelines, and monitoring systems to ensure high availability, scalability, and security of our services.
Required Technical Skills
- Core Programming & Frameworks: Proficiency in Python with hands-on experience in FastAPI or Flask frameworks
- Core Programming & Frameworks: Strong understanding of RESTful API development and microservices architecture
- Containerization & Orchestration: Extensive experience with Docker and Docker Compose
- Containerization & Orchestration: Container lifecycle management and optimization
- Google Cloud Platform (GCP) Infrastructure: Google Cloud Storage (GCS) automated backup solutions
- Google Cloud Platform (GCP) Infrastructure: Google Container Registry for private Docker repositories
- Google Cloud Platform (GCP) Infrastructure: Google Artifact Registry for private PyPI package management
- Google Cloud Platform (GCP) Infrastructure: Cloud IAM for access control and permission management
- Google Cloud Platform (GCP) Infrastructure: VPC Firewall Rules configuration and network security
- Google Cloud Platform (GCP) Infrastructure: Cloud Load Balancing and Cloud CDN setup
- MySQL/PostgreSQL Database Administration: Database performance monitoring and query optimization
- MySQL/PostgreSQL Database Administration: Disk space management and storage optimization
- MySQL/PostgreSQL Database Administration: Automated backup strategies and data archival/purging policies
- MySQL/PostgreSQL Database Administration: Database maintenance and cleanup procedures
- Logging & Observability: Implementation of centralized logging solutions (ELK Stack, Cloud Logging)
- Logging & Observability: Log aggregation, parsing, and visualization using tools like Grafana, Kibana, or Cloud Monitoring
- Logging & Observability: Structured logging best practices
- Monitoring & Alerting: Performance monitoring setup using Prometheus, Grafana, or Cloud Monitoring
- Monitoring & Alerting: Application Performance Monitoring (APM) tools integration
- Monitoring & Alerting: Alert configuration and incident response automation
- Monitoring & Alerting: SLA/SLO monitoring and reporting
- Message Queue Management: Apache Kafka cluster setup, configuration, and maintenance
- Message Queue Management: Topic management, partition optimization, and consumer group monitoring
- Message Queue Management: Kafka Connect and Schema Registry management
- CI/CD Pipeline: Design and implementation of automated deployment pipelines
- CI/CD Pipeline: GitHub CI/CD integration and pipeline optimization
- CI/CD Pipeline: Automated testing integration and deployment strategies
- CI/CD Pipeline: Blue-green and canary deployment patterns
- Network & Security: Cloudflare Tunnel (cloudflared) configuration and management
- Network & Security: SSH tunneling and secure remote access solutions
- Network & Security: Nginx web server configuration and optimization
- Network & Security: systemd service management and daemon configuration
- Network & Security: Network troubleshooting and performance optimization
Preferred Additional Skills
- Container Orchestration: Kubernetes (K8s) deployment and cluster management experience
- Container Orchestration: Helm charts creation and management
- Container Orchestration: Pod autoscaling and resource optimization
- Task Management & Automation: Python Celery for distributed task queue management
- Task Management & Automation: Cron job centralization and monitoring
- Task Management & Automation: Workflow orchestration tools experience
- Data Analysis & Documentation: Jupyter Notebook and pandas for data analysis and reporting
- Data Analysis & Documentation: Data pipeline automation and ETL processes
- Knowledge Management: Experience creating and maintaining internal wikis or runbooks
- Knowledge Management: Technical documentation and knowledge sharing platforms
- Knowledge Management: GitBook, Confluence, or similar documentation tools
- Additional Technical Areas: Redis caching solutions
- Additional Technical Areas: Cost optimization and resource management in cloud environments
Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field
- 3+ years of experience in DevOps, Site Reliability Engineering, or related roles
- Strong problem-solving skills and ability to work in a fast-paced environment
- Excellent communication skills
- Experience working in Agile/Scrum development environments