Enable job alerts via email!

Semantic Backend Engineer (Contract, Remote)

Infuse

Polokwane

Remote

ZAR 600 000 - 900 000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology company in Polokwane is seeking an experienced professional to own the ETL pipeline and enhance their machine learning models. In this role, you will work with diverse tools such as Python, PyTorch, and FastAPI to manage content ingestion and ensure resource quality. If you are passionate about transforming unstructured data into valuable insights, we would love to hear from you. This is a dynamic opportunity to contribute to innovative solutions and make a significant impact.

Qualifications

  • Experience building ML pipelines that impact users.
  • Familiarity with semantic search and handling unstructured data.
  • Ability to work in fast-paced environments and track vital metrics.

Responsibilities

  • Own the ETL pipeline from raw PDFs to structured resources.
  • Finalize summarization and classification flow using open-source models.
  • Implement freshness logic for content indexing.

Skills

ETL pipeline management
Machine learning
Semantic search
Data filtering
Python programming

Tools

PyTorch
FastAPI
Docker
Job description

INFUSE is committed to complying with applicable data privacy and security laws and regulations. For more information, please see our Privacy Policy.

Overview

INKHUB is ingesting 10 million raw PDFs to build the Internet's richest catalog of marketing-grade B2B content — tagged, summarized, and searchable by topic, company, or intent.

What You’ll Do
  • Own the ETL pipeline from raw PDFs (S3-ingested) to structured resources.
  • Finalize our summarization and classification flow using open-source models with GPT-4o fallback.
  • Apply filtering logic (e.g., 3‑year age, page count) to enforce resource quality.
  • Map each asset to the specific topic taxonomy (10+ per topic across ~9 topics).
  • Generate dense embeddings using sentence‑transformers.
  • Load and query embeddings using Milvus or pgvector.
  • Implement “freshness” logic to identify and index only new or updated content based on file diffing, crawl timestamp, or document hash.
  • Build a QA/eval harness: format compliance, recall@5, drift monitoring.
  • Expose /v1/semantic‑search via FastAPI, with filtering and rank fusion.
  • Collaborate closely with our Tech Lead on UX integration and snippet generation.
Your Toolbox

Python, PyTorch, sentence‑transformers, OpenAI APIs, or similar pretrained LLMs. FastAPI, Milvus or pgvector, PyPDF/Tika, Airflow or Lambda for orchestration. Docker, GPU scheduling, Athena/Redshift SQL.

You Might Be a Fit If
  • You’ve built ML pipelines that touched real users, not just notebooks.
  • You’ve worked on semantic search, embeddings, or large‑scale tagging.
  • You’ve wrestled with unstructured data and love turning chaos into clarity.
  • You like working fast, iterating with feedback, and tracking metrics that matter.
Why This Role Matters

Your models decide what gets found, how it's tagged, and which content and companies stand out. You’ll help define what “relevance” and “freshness” mean for over a million resources and 50+ company pages and make sure INKHUB stays ahead of the curve.

Referrals increase your chances of interviewing at INFUSE by 2x.

Be among the first 25 applicants to get a fair and detailed assessment from our seasoned recruiting professionals.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.