SUMMARY :
POSITION INFO :
ENVIRONMENT :
A provider of cutting-edge Financial Tools in Joburg seeks the technical expertise of a Platform Engineer to manage Heroku pipelines, CI / CD, review apps, and production environments. You will also operate Celery workers and queues, monitor health, and handle missed task check-ins, manage Cloudflare for DNS, edge security, and performance optimisation & collaborate with Developers to streamline workflows and educate on secure coding practices. The ideal candidate must have 3+ years’ operating production apps on Heroku, AWS, DigitalOcean, or similar, CI / CD pipelines : Hands‑on experience with GitHub Actions, Heroku CI, or equivalent; solid Git fundamentals and Monitoring & incident response : Experience with Sentry, Papertrail (or similar), logs, and uptime / performance dashboards.
DUTIES :
Reliability & Operations -
- Own uptime, performance, and monitoring for all production applications.
- Manage Heroku pipelines, CI / CD, review apps, and production environments.
- Operate Celery workers and queues, monitor health, and handle missed task check‑ins.
- Define and track service level objectives (SLOs) (availability, latency, task success rate).
- Maintain runbooks, a centralised wiki for incident response, and lead post‑mortems.
- Run periodic disaster recovery drills and coordinate Penetration Tests.
Platform Engineering -
- Keep environments current (Heroku stacks, Postgres / Redis versions, DO / AWS base images).
- Manage daily backups, ensure restore tests and disaster recovery runbooks are in place.
- Standardise infrastructure (Terraform or scripts for DO / AWS; app.json for Heroku).
- Manage Cloudflare for DNS, edge security, and performance optimisation.
- Tune performance (DB indices, query optimisation, cache usage, Celery queue design).
- Optimise infrastructure costs across Heroku, DigitalOcean, and AWS.
Developer Experience & CI / CD -
- Maintain CI pipelines with type checking, linting, and security scanning.
- Enforce test coverage and automate deploy checks (smoke tests, migration health, error budgets).
- Support Developers with tooling for local / staging environments and build self‑service dashboards (e.g., Celery queue status).
- Collaborate with Developers to streamline workflows and educate on secure coding practices.
Security & Compliance -
- Own vulnerability management and dependency patching cadence.
- Manage access reviews, secrets, MFA / SSO, and enforce least‑privilege IAM policies.
- Implement encryption for data at rest and in transit (e.g., S3 server‑side encryption).
- Contribute evidence and responses for security questionnaires and SOC 2 audits.
- Maintain a “security pack” with architecture, sub‑processors, and DR / backup processes.
Monitoring & Alerting -
- Configure Sentry ownership rules, Cron Monitors, and release health.
- Centralise metrics / logs (Heroku metrics, Papertrail, Sentry, APM, Prometheus / New Relic).
- Set up alerts on golden signals (latency, errors, traffic, saturation) and avoid alert fatigue.
- Conduct capacity planning and track resource usage trends.
Vendor & External Services -
- Evaluate and manage vendor relationships (e.g., Mailgun, Twilio) to ensure service level agreements (SLAs) and performance.
- Assess new tools / services to enhance platform capabilities (e.g., observability, security).
- Track costs, security posture, and integration quality for all third‑party services.
REQUIREMENTS : Must‑Haves -
- Cloud Infrastructure Management : 3+ years’ operating production apps on Heroku, AWS, DigitalOcean, or similar.
- CI / CD pipelines : Hands‑on experience with GitHub Actions, Heroku CI, or equivalent; solid Git fundamentals.
- Monitoring & incident response : Experience with Sentry, Papertrail (or similar), logs, and uptime / performance dashboards.
- Security Fundamentals : Understanding of IAM, encryption in transit / at rest, MFA / SSO, and secure configuration practices.
- Disaster recovery & backups : Experience implementing and operating automated backups, restore testing, and writing / maintaining incident runbooks.
- Communication & collaboration : Ability to document processes clearly and work closely with Developers in a small team.
Strong Plus -
- Infrastructure as Code & automation : Experience with Terraform, Docker, or equivalent tooling.
- Asynchronous workloads : Familiarity with Celery, Redis, or other task queues and message brokers.
- Scaling & cost optimisation : Capacity planning, performance tuning, and managing infra spend.
- Compliance frameworks : Exposure to SOC 2, GDPR, or supporting client security questionnaires.
- Incident management : Participation in on‑call rotations, leading post‑mortems, or serving as incident commander.
Nice-to-Haves -
- Certifications (AWS Certified DevOps Engineer, CKS, or equivalent).
- Proficiency in Python; familiarity with Django / Flask.
- Experience with DNS / CDN / edge security (e.g., Cloudflare).
- Observability platforms (Prometheus, Grafana, New Relic).
- Static analysis and code quality tools (mypy, Bandit, SonarQube).
- Prior exposure to multi‑tenant SaaS environments.