Role Summary
The Data Scientist will turn GPU/facility/DC telemetry and operational data into predictive models, patterns and insights that help Firmus AI Factory users optimize workload performance and energy consumption. You'll build anomaly detection, forecasting, and efficiency scoring that differentiate the platform. Your models and insights power the energy analytics that are a key competitive advantage for Firmus FactoryOS and drive customer cost optimization decisions.
Key Responsibilities
- Analyze GPU and facility time-series data: identify patterns, leading indicators of degradation, thermal stress, power throttling.
- Build predictive models: forecast power demand, detect anomalies, predict resource contention, recommend optimal batch sizes.
- Quantify energy consumption per workload: kWh/Joule per training job, per-token energy for inference, energy vs. performance curves.
- Build AI workload profiles with correlation to energy consumption for different AI work types and stages of the work.
- Build energy efficiency scoring: rate jobs/clusters/tenants on efficiency (e.g., “this cluster runs at 40% MFU; optimal is 65%”).
- Implement anomaly detection models (Isolation Forest, autoencoders, statistical) for real‑time cluster monitoring.
- Implement event correlation: when anomalies are detected, correlate with telemetry events to suggest root causes.
- Create incident copilot features: anomaly detected → summarize relevant telemetry → suggest likely causes and actions.
- Build RAG evaluation metrics: retrieval accuracy (NDCG, MRR), reranking quality, end‑to‑end answer quality.
- Implement continuous monitoring for model drift; retrain models as patterns evolve.
- Productionize models into pipelines: batch prediction, real‑time scoring, metric updates.
Skills and Experience
- 5–7 years of data science experience focused on time‑series analysis, anomaly detection, or operational data.
- Proficiency with ARIMA, Prophet, state‑space models, autoencoders, or deep learning for time‑series forecasting.
- Strong statistical foundation: hypothesis testing, confidence intervals, uncertainty quantification.
- Expert Python/R: pandas, scikit‑learn, PyTorch/TensorFlow, Jupyter; can build end‑to‑end analysis and productionize models.
- Hands‑on data quality practices: handle missing data, sensor noise, outliers, validation before modelling.
- Experience with Prometheus, Grafana, or observability platforms for accessing operational metrics.
- Comfort with anomaly detection frameworks (Isolation Forest, LOF, autoencoders) and event correlation.
Key Competencies
- Time‑Series Mastery: deeply understands seasonality, trend, noise, stationarity, forecasting trade‑offs.
- Production Mindset: not just Jupyter notebooks; thinks about model deployment, retraining, monitoring in production.
- Communication: explains findings to both technical engineers and non‑technical operators/customers clearly.
- Rigor: validates models on hold‑out test sets; reports false‑positive rates, detection latency, uncertainty.
- Curiosity: asks “why” questions; doesn’t just fit models; understands the business impact.
Success Metrics
- Actionable detection (not noise): anomalies are detected quickly with acceptable false positives and strong operator confidence.
- Forecasting & planning accuracy improves: forecasts are accurate enough to inform capacity and energy planning decisions.
- Measured efficiency impact: insights drive reductions in waste and/or cost (GPU‑hours, energy‑per‑workload) where adopted.
- Telemetry trust & completeness stays high: data quality supports billing/ops/optimization decisions reliably.
- RCA acceleration via analytics: analytics shorten investigations for repeat incident classes and reduce time‑to‑identify‑likely‑cause.
Location & Reporting
- Singapore or Australia (Launceston, TAS or Sydney, NSW)
- Reporting to Head of AI & Applications
Employment Basis
Full‑time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting‑edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.