|
1 | 1 | # predoc-coding-sample |
2 | 2 |
|
3 | | -Clean, reproducible research pipeline for an applied micro / health & labor project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, then estimates how those types relate to labour-market outcomes. |
| 3 | +Clean, reproducible research pipeline for an applied micro / health & labour |
| 4 | +project on UKHLS data. The pipeline identifies latent health types via K-Means |
| 5 | +clustering on frailty trajectories, fits HC3 OLS regressions with full |
| 6 | +diagnostics, and emits publication-ready tables and figures that the paper |
| 7 | +and slide deck consume directly. |
| 8 | + |
| 9 | +[](https://github.com/gavinjqu/predoc-coding-sample/actions/workflows/ci.yml) |
4 | 10 |
|
5 | 11 | ## Setup |
| 12 | + |
| 13 | +The repo is managed with [uv](https://docs.astral.sh/uv/): |
| 14 | + |
6 | 15 | ```bash |
7 | | -python3 -m venv .venv |
8 | | -source .venv/bin/activate |
9 | | -python -m pip install -e . |
| 16 | +uv sync |
10 | 17 | ``` |
11 | 18 |
|
| 19 | +This creates `.venv/`, pins Python to `>=3.11`, and installs every dependency |
| 20 | +listed in `pyproject.toml` against the locked `uv.lock`. |
| 21 | + |
12 | 22 | ## Data |
13 | 23 |
|
14 | | -- `data/raw/frailty_long_panel.parquet` — frailty index, age, wave, death (gitignored) |
15 | | -- `data/raw/ukhls_demographic_panel.parquet` — sex, education, labour-force status, hourly pay (gitignored, produced by `ingest_ukhls`) |
16 | | -- `data/raw/death_data_long_panel.parquet` — death records (gitignored) |
17 | | -- `data/derived/` — intermediate pipeline outputs (gitignored, regenerated each run) |
| 24 | +UKHLS microdata is restricted and **gitignored**: |
18 | 25 |
|
19 | | -The raw `*.dta` files in `data/ukhls/` are gitignored. Once `ukhls_demographic_panel.parquet` is generated they can be deleted. |
| 26 | +| Path | Contents | Source | |
| 27 | +|------|----------|--------| |
| 28 | +| `data/ukhls/{a..m}_indresp.dta` | Raw UKHLS Stata files | UK Data Service licence | |
| 29 | +| `data/raw/frailty_long_panel.parquet` | Per-(pidp, wave) frailty index, age, death, raw `hcond*` flags | `notebooks/archive/frailty_main.ipynb` | |
| 30 | +| `data/raw/ukhls_demographic_panel.parquet` | Per-(pidp, wave) sex, education, labour-force status, hourly pay | `python -m src.pipeline.ingest_ukhls` | |
| 31 | +| `data/raw/death_data_long_panel.parquet` | Death records | UKHLS death linkage | |
| 32 | +| `data/derived/` | Intermediate pipeline outputs (regenerated each run) | `./run.sh` | |
20 | 33 |
|
21 | | -## One-time setup: build the demographic panel |
| 34 | +The `.dta` directory can be deleted once `ukhls_demographic_panel.parquet` |
| 35 | +exists; the pipeline reads only the parquet artefacts. |
22 | 36 |
|
23 | | -If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from the UKHLS Stata files: |
| 37 | +### One-time setup: build the demographic panel |
| 38 | + |
| 39 | +If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from |
| 40 | +the UKHLS Stata files: |
24 | 41 |
|
25 | 42 | ```bash |
26 | | -python -m src.pipeline.ingest_ukhls --ukhls-dir data/ukhls --output data/raw/ukhls_demographic_panel.parquet |
| 43 | +uv run python -m src.pipeline.ingest_ukhls \ |
| 44 | + --ukhls-dir data/ukhls \ |
| 45 | + --output data/raw/ukhls_demographic_panel.parquet |
27 | 46 | ``` |
28 | 47 |
|
| 48 | +This is a one-shot conversion (~30s on first run, 5 MB output). Once it's |
| 49 | +done, you can `rm -rf data/ukhls/`. |
| 50 | + |
29 | 51 | ## Run the pipeline |
30 | 52 |
|
31 | 53 | ```bash |
32 | 54 | ./run.sh |
33 | | -# or |
34 | | -.venv/bin/python3 -m src.cli --config configs/config.yaml |
| 55 | +# equivalent to |
| 56 | +uv run python -m src.cli --config configs/config.yaml |
35 | 57 | ``` |
36 | 58 |
|
37 | | -**Pipeline steps:** |
38 | | -1. `ingest` — reads frailty + demographic parquets, merges on `(pidp, wave)`, computes lagged frailty |
39 | | -2. `cluster` — K-Means on per-individual frailty trajectories at ages 50-60 (k=3 by default) |
40 | | -3. `estimate` — OLS of frailty on demographic controls, with cluster dummies; HC3 SEs and full diagnostic suite (joint F-test, Breusch-Pagan, Durbin-Watson, VIFs) |
41 | | -4. `report` — generates summary tables and figures |
| 59 | +**Pipeline steps** (defined in [src/cli.py](src/cli.py)): |
| 60 | + |
| 61 | +1. **`ingest`** — reads frailty + demographic parquets, merges on `(pidp, wave)`, renames keys, converts wave letters to integers, computes lagged frailty. |
| 62 | +2. **`cluster`** — K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel. |
| 63 | +3. **`estimate`** — OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs. |
| 64 | +4. **`report`** — generates 5 tables (CSV + booktabs `.tex`) and 11 figures, all driven by [src/analysis/figures.py](src/analysis/figures.py) and [src/analysis/tables.py](src/analysis/tables.py). |
42 | 65 |
|
43 | 66 | ### Outputs |
44 | 67 |
|
45 | | -Tables (`output/tables/`): |
46 | | -- `tab01_summary_stats.csv` — counts and means by health type |
47 | | -- `tab02_frailty_by_wave.csv` — frailty by wave × health type |
48 | | -- `tab03_main_regression.csv` — basic vs full OLS coefficients |
49 | | -- `tab04_employment_by_type.csv` — employment / unemployment / inactivity shares |
50 | | -- `tab05_education_by_type.csv` — educational attainment shares (ages 50-60) |
| 68 | +**Tables** (`output/tables/` — both `.csv` and `.tex`): |
| 69 | + |
| 70 | +| File | Contents | |
| 71 | +|------|----------| |
| 72 | +| `tab01_summary_stats` | Counts, means, share-of-sample by health type | |
| 73 | +| `tab02_frailty_by_wave` | Mean frailty by wave × type | |
| 74 | +| `tab03_main_regression` | Stargazer-rendered basic vs. full OLS table | |
| 75 | +| `tab04_employment_by_type` | Employment / unemployment / inactivity shares | |
| 76 | +| `tab05_education_by_type` | Educational attainment shares (ages 50–60) | |
| 77 | + |
| 78 | +**Figures** (`output/figures/`, all rendered with the custom paper style in |
| 79 | +[src/analysis/_style.py](src/analysis/_style.py)): |
| 80 | + |
| 81 | +| File | Section | Contents | |
| 82 | +|------|---------|----------| |
| 83 | +| `fig01_frailty_trajectories.png` | Appx B | Frailty trajectories by wave | |
| 84 | +| `fig02_frailty_distribution.png` | Appx B | Within-cluster frailty distributions | |
| 85 | +| `fig03_cluster_diagnostics.png` | §3 | Elbow + silhouette twin-axis | |
| 86 | +| `fig04_frailty_by_age.png` | §4.1 | Binned mean frailty by age × type (headline) | |
| 87 | +| `fig05_employment_by_type.png` | §4.2 | Employment rate by age × type | |
| 88 | +| `fig06_earnings_by_type.png` | §4.2 | Mean hourly pay by age × type (95% CIs) | |
| 89 | +| `fig07_frailty_by_age_scatter.png` | §2 | Binned mean frailty by age, full panel | |
| 90 | +| `fig08_education_by_type.png` | §4.3 | Stacked-bar education shares by type | |
| 91 | +| `fig09_pay_by_education.png` | §4.3 | Mean hourly pay by qualification | |
| 92 | +| `fig10_healthcond_variables_by_wave.png` | §2 | UKHLS variable-family wave coverage | |
| 93 | +| `fig11_mortality_by_age.png` | §2 | Mortality (frailty=1) by age | |
| 94 | + |
| 95 | +**Metrics:** `output/metrics/metrics.json` — appended per-run with run-id, git |
| 96 | +commit, cluster counts, regression diagnostics. |
| 97 | + |
| 98 | +**Log:** `output/logs/pipeline.log`. |
51 | 99 |
|
52 | | -Figures (`output/figures/`): |
53 | | -- `fig01_frailty_trajectories.png` — frailty trajectories by wave |
54 | | -- `fig02_frailty_distribution.png` — within-cluster frailty distributions |
55 | | -- `fig03_cluster_diagnostics.png` — elbow and silhouette plots |
56 | | -- `fig04_frailty_by_age.png` — binned mean frailty by age and health type |
57 | | -- `fig05_employment_by_type.png` — employment rate by age and health type |
58 | | -- `fig06_earnings_by_type.png` — mean hourly pay by age and health type with 95% CIs |
| 100 | +## Paper |
| 101 | + |
| 102 | +- **LaTeX sources:** [paper/tex/](paper/tex/) — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C. |
| 103 | +- **Build:** `./paper/build.sh` → [paper/final/final.pdf](paper/final/final.pdf) (~21 pages, 1.5 MB). Auto-runs `./run.sh` first if any required figure/table is missing. |
| 104 | +- **Slide deck:** [paper/slides/seminar.tex](paper/slides/seminar.tex) — 24-frame Beamer deck for a ~30-min seminar talk. Built via `./paper/slides/build.sh` → [paper/final/slides/seminar.pdf](paper/final/slides/seminar.pdf). Uses the `primary-blue`/`primary-gold` palette from [Preambles/header.tex](paper/slides/header.tex). |
59 | 105 |
|
60 | | -Metrics: `output/metrics/metrics.json` (run-id, git commit, cluster counts, regression diagnostics) |
61 | | -Log: `output/logs/pipeline.log` |
| 106 | +Both build scripts run the full pipeline first if any required artefact is |
| 107 | +missing, so a fresh clone with UKHLS data in place is one command away from a |
| 108 | +compiled PDF. |
62 | 109 |
|
63 | 110 | ## Tests |
64 | 111 |
|
65 | 112 | ```bash |
66 | | -.venv/bin/python3 -m pytest tests/test_pipeline.py -v |
| 113 | +uv run pytest -q |
67 | 114 | ``` |
68 | 115 |
|
69 | | -Uses a 200-individual subsample from the real frailty panel. |
| 116 | +The test suite ([tests/test_pipeline.py](tests/test_pipeline.py)) takes a |
| 117 | +200-individual subsample of the frailty panel and runs the pipeline |
| 118 | +end-to-end against it, asserting schema, no-duplicate keys, valid frailty |
| 119 | +range, and expected outputs. **When the UKHLS data isn't present** |
| 120 | +(e.g.\ in CI on a public runner) the suite cleanly skips every test instead |
| 121 | +of erroring, so the build stays green. |
70 | 122 |
|
71 | | -## Notes |
| 123 | +## Continuous integration |
72 | 124 |
|
73 | | -- Notebooks are archived in `notebooks/archive/` for provenance only; the pipeline does not depend on them. |
| 125 | +[.github/workflows/ci.yml](.github/workflows/ci.yml) runs a minimal smoke |
| 126 | +check on every push and PR, against Python 3.11 and 3.12: |
74 | 127 |
|
75 | | -## Paper |
| 128 | +1. `uv sync` — catches dependency drift / lock-file inconsistencies. |
| 129 | +2. Import smoke — imports every module under `src/` to catch syntax errors |
| 130 | + and broken cross-module references. |
| 131 | +3. `pytest -q` — collects the suite. Tests skip in CI because UKHLS data |
| 132 | + isn't available; what's verified is that pytest collection succeeds and |
| 133 | + the test code itself parses. |
| 134 | + |
| 135 | +The CI is honest about what it can and can't check given a restricted-data |
| 136 | +project: dependency / import / syntax health, not full reproduction. |
| 137 | + |
| 138 | +## Notes |
76 | 139 |
|
77 | | -- LaTeX sources: `paper/tex/` |
78 | | -- Compiled output: `paper/final/` |
| 140 | +- Notebooks are archived in [notebooks/archive/](notebooks/archive/) for |
| 141 | + provenance only; the pipeline does not depend on them. |
| 142 | +- All figures are generated with the custom matplotlib rcParams in |
| 143 | + [src/analysis/_style.py](src/analysis/_style.py) — no `plt.style.use(...)` |
| 144 | + of any built-in theme. |
| 145 | +- The regression LaTeX table is rendered via Stargazer in |
| 146 | + [src/pipeline/estimate.py](src/pipeline/estimate.py); descriptive tables |
| 147 | + use `pandas.to_latex()` wrapped in booktabs in |
| 148 | + [src/analysis/tables.py](src/analysis/tables.py). |
0 commit comments