Data engineeringPersonal projectNov 2025

Turn AFL Tables HTML into analysis-ready datasets.

AFL Tables is one of the best resources for historical AFL data — but it's not built for programmatic access. This pipeline scrapes season and match-level pages, parses them into a clean schema, and exports two joining CSVs: one row per match, one row per player per match. Ready for modelling, analysis, or exploration.

The problem

Great data. No programmatic access.

HTML-only access

AFL Tables is an excellent resource, but not built for programmatic access. Extracting historical match and player data meant navigating complex HTML tables, season pages, and individual match links — by hand.

No reproducible datasets

Without a pipeline, any analysis required manual exports and copy-paste. Revisiting a season or adding new data meant doing the same manual work again.

Inconsistent page structures

Season pages and match stats pages have different layouts. A scraper that handles one often breaks on the other, requiring constant maintenance.

Two page types. One parser each.

AFL Tables uses different HTML structures for season pages and match stats pages. Season pages encode each match as two rows — one for the home team, one for the away — with the match stats link buried in the second row. Match stats pages use separate tables for each team with per-player columns labelled in AFL shorthand.

The pipeline handles both: a season parser walks all rows and buffers two-row pairs into a single match record, while the match stats parser maps shorthand headers to clean column names using a central lookup. Both parsers cache raw HTML locally so you can iterate on the logic without hitting the site repeatedly.

Stat columns extracted

KI→ Kicks

MK→ Marks

HB→ Handballs

DI→ Disposals

GL→ Goals

BH→ Behinds

HO→ Hitouts

TK→ Tackles

CL→ Clearances

CP→ Contested possessions

UP→ Uncontested possessions

GA→ Goal assists

The data model it produces

The pipeline produces two clean, joinable CSVs. matches.csv has one row per game — teams, scores, quarter breakdowns, margin, venue, attendance, and a stable match ID. player_games.csv has one row per player per match, with all the stat columns mapped from AFL shorthand.

Both tables share match_id, which lets you join them cleanly in pandas, SQL, or any ML workflow. The schema is designed to be stable across seasons — add more years and the same columns apply.

player_games.csv — sample output

Player	Team	KI	MK	HB	DI	GL	BH	TK
J. Doe	Home	28	8	14	42	2	1	6
J. Doe	Home	19	5	11	30	0	0	4
J. Doe	Away	24	11	9	33	3	2	5
J. Doe	Away	15	3	18	33	0	1	8

One row per player per match — joins to matches.csv on match_id. Ready for pandas or ML.

Pipeline

From HTML table to analysis-ready CSV in five steps.

Build season URLs

Construct AFL Tables season URLs for each configured year — e.g. /seas/2025.html.

Fetch and cache

Download HTML and cache it locally. Subsequent runs read from disk — no repeated requests to the site.

Parse season pages

Extract match rows (teams, quarter scores, dates, attendance, venue) and collect links to individual match stats pages.

Parse match stats

Follow each match link and extract per-player statistics, mapped from AFL shorthand (KI, MK, DI…) to clean column names.

Export CSVs

Write matches.csv and player_games.csv — both join on match_id, ready for pandas, SQL, or ML workflows.

Build season URLs

Construct AFL Tables season URLs for each configured year — e.g. /seas/2025.html.

Fetch and cache

Download HTML and cache it locally. Subsequent runs read from disk — no repeated requests to the site.

Parse season pages

Extract match rows (teams, quarter scores, dates, attendance, venue) and collect links to individual match stats pages.

Parse match stats

Follow each match link and extract per-player statistics, mapped from AFL shorthand (KI, MK, DI…) to clean column names.

Export CSVs

Write matches.csv and player_games.csv — both join on match_id, ready for pandas, SQL, or ML workflows.

Technologies used

Python

Requests

BeautifulSoup

pandas

Have a data source that's valuable but hard to access programmatically?

Let's build a pipeline that turns it into something you can actually work with — clean, reproducible, and ready for analysis.

Let's talk →