Enable job alerts via email!

Lead Data Acquisition Engineer – UK Commercial Energy Platform

APEXION

Remote

GBP 50,000 - 70,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading data solutions company seeks a Lead Data Acquisition Engineer to build data pipelines for a national commercial energy platform. This role involves designing scraping processes, managing large public datasets, and building an entity resolution pipeline to ensure accurate data modelling. Candidates should have experience with Python, SQL, and data scraping pipelines, along with a solid understanding of UK datasets. This position is remote and requires close collaboration with the founding team.

Qualifications

3-6+ years as a Data Engineer or Data Acquisition Engineer.
Experience scraping large public datasets at scale.
Strong background in entity resolution and fuzzy matching.

Responsibilities

Own data acquisition and scraping.
Design and run scraping pipelines for public datasets.
Build entity resolution pipelines to normalize company names.

Skills

Strong Python

Strong SQL

Data scraping experience

Entity resolution

Basic geospatial knowledge

Tools

Scrapy

BeautifulSoup

Airflow

We’re building a national-scale data platform for UK commercial energy.

At the core is a unified view of every commercial building in the UK, and an estimate of annual energy consumption and load profile for each occupant.

We’ve already built the core spine (AddressBase, VOA, leases, CCOD / OCOD, INSPIRE, planning, NNDR, EPC, permits, renewables, Companies House). Now we need someone to own data acquisition and occupant modelling on top of this.

ROLE

Lead Data Acquisition Engineer – UK Commercial Energy Platform

Type: Full-time or long-term contract

Location: Remote (UK or Europe timezone preferred)

WHAT WE’VE BUILT SO FAR

Our current building / occupant spine includes :

OS AddressBase Core as the UPRN spine
VOA valuation and floor area data
Long leases
CCOD / OCOD + INSPIRE polygons
Planning application data and NNDR (where available)
EPC non-domestic data
Environment Agency & DEFRA permitting datasets
UK coverage of existing renewable projects
Companies House API linkage

Your job is to sit on top of this spine and turn it into something truly useful for per-occupant energy modelling.

THE PROBLEM YOU’RE SOLVING

For each of ~2 million UK commercial buildings we want to know :

Who the actual occupant(s) are
How they operate in detail
What that implies for energy use and load shape

A plastics manufacturer is not the same as a frozen food warehouse, an office, or a logistics hub.

We care about :

What they manufacture or do
What machinery they have on-site
What processes they run, and when they run them

This is not a one-off scrape. It’s a systematic, repeatable pipeline that touches millions of rows.

WHAT YOU WILL DO

Own data acquisition and scraping
Design and run scraping / ingestion pipelines for:
- DNO and other network datasets
- Government and regulator datasets
- Company-level and facility-level data beyond Companies House
- Public signals of operations : websites, “our plant” pages, datasheets, job ads, fleet pages, Google Maps / Street View, industry directories, etc.
Build robust scrapers at scale:
- Parallelisation, retries, throttling, proxy management, error handling
- Logging and monitoring so we know what ran, what failed, and why
- Resolve who actually occupies each building
- Extend our NNDR-based approach and close the gaps:
  - Link buildings to occupants using NNDR, Companies House, planning & permitting data, web presence and other public sources
Build an entity resolution pipeline that:
- Normalises and matches company names and addresses
- Uses fuzzy matching with confidence scores
- Maintains a master building-to-occupant table with history and provenance
- Engineer occupant‑specific, process‑level variables
- For each building occupant, design and populate variables that matter for energy, for example:
  - Industry and sub‑industry (SIC + text classification)
  - Building function / process type:
    - – Manufacturing vs distribution vs office vs retail
    - – Plastics vs food vs metals vs pharma, etc.
    - – Cold storage, data centre, heavy process, light assembly
  - Operational characteristics:
    - – Opening hours and shift patterns
    - – 24 / 7 vs office hours
    - – Indicative vehicle and truck movements
    - – Refrigeration, compressed air, process heat, HVAC type
  - Machinery and equipment indicators, where possible:
    - – Presence of large motors, injection moulders, CNC machines, presses, ovens, kilns, furnaces, chillers, freezers, compressors, data centre racks, etc.
    - – Signals from permits, product specs, job adverts (“CNC milling centre”, “ammonia refrigeration plant”), site photos, equipment lists, OEM case studies and similar
Join all of this back to :
- VOA dimensions
- EPC primary energy and HVAC / fuel indicators
- Scope 2 and emissions disclosures where available

The key is depth and uniformity. A cold‑storage warehouse will have different variables from a law firm, and a plastics injection‑moulding plant different again – but everything must land in a consistent, model‑ready structure across ~2M rows.

Build and document the data layer for the modelling team
Design schemas for long‑term use and refresh
Implement ETL / ELT workflows (ingest → clean → enrich → publish)
Add basic data‑quality checks and reporting
Document sources, joins and assumptions so others can work confidently on top of your layer

WHAT YOU SHOULD ALREADY HAVE DONE

3–6+ years as a Data Engineer, Data Acquisition Engineer or similar
Proven experience scraping and integrating large public or government datasets at scale
A track record of production scraping pipelines, not just one‑off scripts
Strong entity‑resolution background:
- Fuzzy matching, deduplication, record linkage across messy sources
– Ideally with companies and addresses
Experience turning unstructured information (websites, PDFs, job ads, photos) into structured variables
Experience with UK data (ONS, EPC, VOA, NNDR, planning, AddressBase, etc.) is a strong plus

TECHNICAL SKILLS – MUST HAVE

Strong Python:
- requests or httpx
- BeautifulSoup or lxml
- Scrapy and / or Playwright or Selenium for JS‑heavy sites
Strong SQL and experience with a relational warehouse (Postgres, BigQuery, Snowflake or similar)
Experience with an orchestration tool : Airflow, Prefect, Dagster or similar
Comfort with:
- Parallel and async scraping
- Proxy rotation and basic anti‑bot strategies
- Designing and versioning schemas
- Normalising and matching UK addresses and postcodes
Basic geospatial comfort:
- UPRN / UARN, postcodes, lat‑long
- GeoPandas / Shapely / PostGIS at a practical level
Git and collaborative development workflows

NICE TO HAVE

Direct exposure to OS AddressBase, VOA, EPC, NNDR, INSPIRE polygons or similar datasets
Experience in energy, utilities, carbon accounting or real‑estate analytics
Use of NLP for text classification and keyword tagging over large corpora
Experience with graph databases for relationship modelling

WHAT KIND OF PERSON WILL FIT

You like turning messy, inconsistent public data into clean, reliable tables
You enjoy thinking about data models and feature design, not just writing scrapers
You’re comfortable working closely with founders and making pragmatic trade‑offs
You care about building pipelines that can run repeatedly without babysitting

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.