Build a Batch ETL Data Pipeline (Ingest → Transform → Warehouse)
Track: Data Engineering / ETL
Analysts query data. Data engineers move it — reliably, on a schedule, with checks that catch bad records before they reach a dashboard. A pipeline that pulls from a source, cleans it, loads a warehouse, and validates itself is the single most convincing thing a junior data engineer can show. Build one.
What you’ll build: a batch ETL pipeline in Python — ingest data from a real source, transform and clean it, load it into a warehouse, orchestrate the steps so they run in order, and add data-quality checks that fail loudly when something’s wrong. Pick any source with fresh data (a public API, an open dataset); the pipeline mechanics are the point.
Data-engineering interviews are full of “what happens when the source sends garbage?” A pipeline with explicit extract/transform/load stages, an orchestrator, and quality assertions answers that before they ask. It maps to the keywords these roles list: Python, ETL/ELT, data pipeline, SQL, data warehouse, Airflow, orchestration, data quality.
Skills & keywords you’ll demonstrate
Extracting from a source (API or files) into a raw layer
Transforming & cleaning with pandas / SQL
Loading into a warehouse (Postgres or DuckDB) with a sensible schema
Orchestration — running the stages in order, re-runnably (and an Airflow DAG)
Data-quality checks — row counts, nulls, uniqueness — that fail the run on bad data
Starter repo
Clone github.com/OptimalMatch/resume-project-data-pipeline — a pipeline/ with extract/transform/load/run stubs, a quality-checks module, and an Airflow DAG stub. Build it under your own account, committing per milestone. It pairs naturally with the SQL analytics project — this fills the warehouse it queries.
Build it in milestones
Extract. Pull data from your source into a raw file or table, unchanged. Commit.
Transform. Clean types, drop/repair bad rows, normalize columns into the shape you want to store. Commit.
Load. Write the result into a warehouse table with a defined schema. Commit.
Orchestrate. A run.py that executes extract → transform → load in order and can be re-run safely. Commit.
Validate. Add quality checks (expected row count, no nulls in keys, unique IDs) that fail the run when violated. Commit.
Schedule & document. Express the pipeline as an Airflow DAG; README the architecture and how to run it. Commit.
Stretch goals
Incremental loads — only process new/changed records since the last run.
A formal data-quality framework (Great Expectations) with a generated report.
Containerize it and run it on a schedule (cron or a managed scheduler).
Put it on your résumé
“Built a batch ETL pipeline in Python — ingesting from [source], transforming with pandas, and loading a [Postgres/DuckDB] warehouse — orchestrated as an Airflow DAG.”
“Added data-quality checks (row counts, null and uniqueness assertions) that fail the run on bad data, preventing corrupt records from reaching downstream tables.”
Update your résumé and check it with the free ATS resume score — data-engineering roles weight exactly these keywords.
Frequently asked questions
Do I need Airflow installed to build this? Not at first. Build the extract/transform/load stages and an orchestrator in plain Python, then express the same steps as an Airflow DAG to show you know the tool. The starter repo includes a DAG stub you can read without installing Airflow.
What makes this different from a SQL analytics project? Analytics queries existing data; data engineering moves and prepares it. This project builds the pipeline that fills the warehouse — with quality checks that catch bad records — which is exactly what data-engineering roles screen for. It pairs naturally with the SQL analytics project.