Job Search and Career Advice Platform

Enable job alerts via email!

Lead Data Acquisition Engineer – UK Commercial Energy Platform

APEXION

Remote

GBP 50,000 - 70,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading data solutions company seeks a Lead Data Acquisition Engineer to build data pipelines for a national commercial energy platform. This role involves designing scraping processes, managing large public datasets, and building an entity resolution pipeline to ensure accurate data modelling. Candidates should have experience with Python, SQL, and data scraping pipelines, along with a solid understanding of UK datasets. This position is remote and requires close collaboration with the founding team.

Qualifications

  • 3-6+ years as a Data Engineer or Data Acquisition Engineer.
  • Experience scraping large public datasets at scale.
  • Strong background in entity resolution and fuzzy matching.

Responsibilities

  • Own data acquisition and scraping.
  • Design and run scraping pipelines for public datasets.
  • Build entity resolution pipelines to normalize company names.

Skills

Strong Python
Strong SQL
Data scraping experience
Entity resolution
Basic geospatial knowledge

Tools

Scrapy
BeautifulSoup
Airflow
Job description

We’re building a national-scale data platform for UK commercial energy.

At the core is a unified view of every commercial building in the UK, and an estimate of annual energy consumption and load profile for each occupant.

We’ve already built the core spine (AddressBase, VOA, leases, CCOD / OCOD, INSPIRE, planning, NNDR, EPC, permits, renewables, Companies House). Now we need someone to own data acquisition and occupant modelling on top of this.

ROLE
Lead Data Acquisition Engineer – UK Commercial Energy Platform

Type: Full-time or long-term contract

Location: Remote (UK or Europe timezone preferred)

WHAT WE’VE BUILT SO FAR

Our current building / occupant spine includes :

  • OS AddressBase Core as the UPRN spine
  • VOA valuation and floor area data
  • Long leases
  • CCOD / OCOD + INSPIRE polygons
  • Planning application data and NNDR (where available)
  • EPC non-domestic data
  • Environment Agency & DEFRA permitting datasets
  • UK coverage of existing renewable projects
  • Companies House API linkage

Your job is to sit on top of this spine and turn it into something truly useful for per-occupant energy modelling.

THE PROBLEM YOU’RE SOLVING

For each of ~2 million UK commercial buildings we want to know :

  • Who the actual occupant(s) are
  • How they operate in detail
  • What that implies for energy use and load shape

A plastics manufacturer is not the same as a frozen food warehouse, an office, or a logistics hub.

We care about :

  • What they manufacture or do
  • What machinery they have on-site
  • What processes they run, and when they run them

This is not a one-off scrape. It’s a systematic, repeatable pipeline that touches millions of rows.

WHAT YOU WILL DO
  • Own data acquisition and scraping
  • Design and run scraping / ingestion pipelines for:
    • DNO and other network datasets
    • Government and regulator datasets
    • Company-level and facility-level data beyond Companies House
    • Public signals of operations : websites, “our plant” pages, datasheets, job ads, fleet pages, Google Maps / Street View, industry directories, etc.
  • Build robust scrapers at scale:
    • Parallelisation, retries, throttling, proxy management, error handling
    • Logging and monitoring so we know what ran, what failed, and why
    • Resolve who actually occupies each building
    • Extend our NNDR-based approach and close the gaps:
      • Link buildings to occupants using NNDR, Companies House, planning & permitting data, web presence and other public sources
  • Build an entity resolution pipeline that:
    • Normalises and matches company names and addresses
    • Uses fuzzy matching with confidence scores
    • Maintains a master building-to-occupant table with history and provenance
    • Engineer occupant‑specific, process‑level variables
    • For each building occupant, design and populate variables that matter for energy, for example:
      • Industry and sub‑industry (SIC + text classification)
      • Building function / process type:
        • – Manufacturing vs distribution vs office vs retail
        • – Plastics vs food vs metals vs pharma, etc.
        • – Cold storage, data centre, heavy process, light assembly
      • Operational characteristics:
        • – Opening hours and shift patterns
        • – 24 / 7 vs office hours
        • – Indicative vehicle and truck movements
        • – Refrigeration, compressed air, process heat, HVAC type
      • Machinery and equipment indicators, where possible:
        • – Presence of large motors, injection moulders, CNC machines, presses, ovens, kilns, furnaces, chillers, freezers, compressors, data centre racks, etc.
        • – Signals from permits, product specs, job adverts (“CNC milling centre”, “ammonia refrigeration plant”), site photos, equipment lists, OEM case studies and similar
  • Join all of this back to :
    • VOA dimensions
    • EPC primary energy and HVAC / fuel indicators
    • Scope 2 and emissions disclosures where available

The key is depth and uniformity. A cold‑storage warehouse will have different variables from a law firm, and a plastics injection‑moulding plant different again – but everything must land in a consistent, model‑ready structure across ~2M rows.

  • Build and document the data layer for the modelling team
  • Design schemas for long‑term use and refresh
  • Implement ETL / ELT workflows (ingest → clean → enrich → publish)
  • Add basic data‑quality checks and reporting
  • Document sources, joins and assumptions so others can work confidently on top of your layer
WHAT YOU SHOULD ALREADY HAVE DONE
  • 3–6+ years as a Data Engineer, Data Acquisition Engineer or similar
  • Proven experience scraping and integrating large public or government datasets at scale
  • A track record of production scraping pipelines, not just one‑off scripts
  • Strong entity‑resolution background:
    • Fuzzy matching, deduplication, record linkage across messy sources
  • – Ideally with companies and addresses
  • Experience turning unstructured information (websites, PDFs, job ads, photos) into structured variables
  • Experience with UK data (ONS, EPC, VOA, NNDR, planning, AddressBase, etc.) is a strong plus
TECHNICAL SKILLS – MUST HAVE
  • Strong Python:
    • requests or httpx
    • BeautifulSoup or lxml
    • Scrapy and / or Playwright or Selenium for JS‑heavy sites
  • Strong SQL and experience with a relational warehouse (Postgres, BigQuery, Snowflake or similar)
  • Experience with an orchestration tool : Airflow, Prefect, Dagster or similar
  • Comfort with:
    • Parallel and async scraping
    • Proxy rotation and basic anti‑bot strategies
    • Designing and versioning schemas
    • Normalising and matching UK addresses and postcodes
  • Basic geospatial comfort:
    • UPRN / UARN, postcodes, lat‑long
    • GeoPandas / Shapely / PostGIS at a practical level
  • Git and collaborative development workflows
NICE TO HAVE
  • Direct exposure to OS AddressBase, VOA, EPC, NNDR, INSPIRE polygons or similar datasets
  • Experience in energy, utilities, carbon accounting or real‑estate analytics
  • Use of NLP for text classification and keyword tagging over large corpora
  • Experience with graph databases for relationship modelling
WHAT KIND OF PERSON WILL FIT
  • You like turning messy, inconsistent public data into clean, reliable tables
  • You enjoy thinking about data models and feature design, not just writing scrapers
  • You’re comfortable working closely with founders and making pragmatic trade‑offs
  • You care about building pipelines that can run repeatedly without babysitting
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.