Build a Batch ETL Data Pipeline (Ingest → Transform → Warehouse)

Track: Data Engineering / ETL

Analysts query data. Data engineers move it — reliably, on a schedule, with checks that catch bad records before they reach a dashboard. A pipeline that pulls from a source, cleans it, loads a warehouse, and validates itself is the single most convincing thing a junior data engineer can show. Build one.

What you’ll build: a batch ETL pipeline in Python — ingest data from a real source, transform and clean it, load it into a warehouse, orchestrate the steps so they run in order, and add data-quality checks that fail loudly when something’s wrong. Pick any source with fresh data (a public API, an open dataset); the pipeline mechanics are the point.
Get the starter repo on GitHub →

Why this project gets interviews

Data-engineering interviews are full of “what happens when the source sends garbage?” A pipeline with explicit extract/transform/load stages, an orchestrator, and quality assertions answers that before they ask. It maps to the keywords these roles list: Python, ETL/ELT, data pipeline, SQL, data warehouse, Airflow, orchestration, data quality.

Skills & keywords you’ll demonstrate

Starter repo

Clone github.com/OptimalMatch/resume-project-data-pipeline — a pipeline/ with extract/transform/load/run stubs, a quality-checks module, and an Airflow DAG stub. Build it under your own account, committing per milestone. It pairs naturally with the SQL analytics project — this fills the warehouse it queries.

Build it in milestones

  1. Extract. Pull data from your source into a raw file or table, unchanged. Commit.
  2. Transform. Clean types, drop/repair bad rows, normalize columns into the shape you want to store. Commit.
  3. Load. Write the result into a warehouse table with a defined schema. Commit.
  4. Orchestrate. A run.py that executes extract → transform → load in order and can be re-run safely. Commit.
  5. Validate. Add quality checks (expected row count, no nulls in keys, unique IDs) that fail the run when violated. Commit.
  6. Schedule & document. Express the pipeline as an Airflow DAG; README the architecture and how to run it. Commit.

Stretch goals

Put it on your résumé

Update your résumé and check it with the free ATS resume score — data-engineering roles weight exactly these keywords.

Frequently asked questions

Do I need Airflow installed to build this?
Not at first. Build the extract/transform/load stages and an orchestrator in plain Python, then express the same steps as an Airflow DAG to show you know the tool. The starter repo includes a DAG stub you can read without installing Airflow.

What makes this different from a SQL analytics project?
Analytics queries existing data; data engineering moves and prepares it. This project builds the pipeline that fills the warehouse — with quality checks that catch bad records — which is exactly what data-engineering roles screen for. It pairs naturally with the SQL analytics project.

Score your new data-engineering résumé — free →