update readme

gavinjqu · gavinjqu · commit f54b248f9892 · 2026-05-06T13:00:32.000-07:00
diff --git a/README.md b/README.md
@@ -1,78 +1,148 @@
 # predoc-coding-sample
 
-Clean, reproducible research pipeline for an applied micro / health & labor project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, then estimates how those types relate to labour-market outcomes.
+Clean, reproducible research pipeline for an applied micro / health & labour
+project on UKHLS data. The pipeline identifies latent health types via K-Means
+clustering on frailty trajectories, fits HC3 OLS regressions with full
+diagnostics, and emits publication-ready tables and figures that the paper
+and slide deck consume directly.
+
+[![smoke](https://github.com/gavinjqu/predoc-coding-sample/actions/workflows/ci.yml/badge.svg)](https://github.com/gavinjqu/predoc-coding-sample/actions/workflows/ci.yml)
 
 ## Setup
+
+The repo is managed with [uv](https://docs.astral.sh/uv/):
+
 ```bash
-python3 -m venv .venv
-source .venv/bin/activate
-python -m pip install -e .
+uv sync
 ```
 
+This creates `.venv/`, pins Python to `>=3.11`, and installs every dependency
+listed in `pyproject.toml` against the locked `uv.lock`.
+
 ## Data
 
-- `data/raw/frailty_long_panel.parquet` — frailty index, age, wave, death (gitignored)
-- `data/raw/ukhls_demographic_panel.parquet` — sex, education, labour-force status, hourly pay (gitignored, produced by `ingest_ukhls`)
-- `data/raw/death_data_long_panel.parquet` — death records (gitignored)
-- `data/derived/` — intermediate pipeline outputs (gitignored, regenerated each run)
+UKHLS microdata is restricted and **gitignored**:
 
-The raw `*.dta` files in `data/ukhls/` are gitignored. Once `ukhls_demographic_panel.parquet` is generated they can be deleted.
+| Path | Contents | Source |
+|------|----------|--------|
+| `data/ukhls/{a..m}_indresp.dta` | Raw UKHLS Stata files | UK Data Service licence |
+| `data/raw/frailty_long_panel.parquet` | Per-(pidp, wave) frailty index, age, death, raw `hcond*` flags | `notebooks/archive/frailty_main.ipynb` |
+| `data/raw/ukhls_demographic_panel.parquet` | Per-(pidp, wave) sex, education, labour-force status, hourly pay | `python -m src.pipeline.ingest_ukhls` |
+| `data/raw/death_data_long_panel.parquet` | Death records | UKHLS death linkage |
+| `data/derived/` | Intermediate pipeline outputs (regenerated each run) | `./run.sh` |
 
-## One-time setup: build the demographic panel
+The `.dta` directory can be deleted once `ukhls_demographic_panel.parquet`
+exists; the pipeline reads only the parquet artefacts.
 
-If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from the UKHLS Stata files:
+### One-time setup: build the demographic panel
+
+If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from
+the UKHLS Stata files:
 
 ```bash
-python -m src.pipeline.ingest_ukhls --ukhls-dir data/ukhls --output data/raw/ukhls_demographic_panel.parquet
+uv run python -m src.pipeline.ingest_ukhls \
+  --ukhls-dir data/ukhls \
+  --output data/raw/ukhls_demographic_panel.parquet
 ```
 
+This is a one-shot conversion (~30s on first run, 5 MB output). Once it's
+done, you can `rm -rf data/ukhls/`.
+
 ## Run the pipeline
 
 ```bash
 ./run.sh
-# or
-.venv/bin/python3 -m src.cli --config configs/config.yaml
+# equivalent to
+uv run python -m src.cli --config configs/config.yaml
 ```
 
-**Pipeline steps:**
-1. `ingest` — reads frailty + demographic parquets, merges on `(pidp, wave)`, computes lagged frailty
-2. `cluster` — K-Means on per-individual frailty trajectories at ages 50-60 (k=3 by default)
-3. `estimate` — OLS of frailty on demographic controls, with cluster dummies; HC3 SEs and full diagnostic suite (joint F-test, Breusch-Pagan, Durbin-Watson, VIFs)
-4. `report` — generates summary tables and figures
+**Pipeline steps** (defined in [src/cli.py](src/cli.py)):
+
+1. **`ingest`** — reads frailty + demographic parquets, merges on `(pidp, wave)`, renames keys, converts wave letters to integers, computes lagged frailty.
+2. **`cluster`** — K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel.
+3. **`estimate`** — OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs.
+4. **`report`** — generates 5 tables (CSV + booktabs `.tex`) and 11 figures, all driven by [src/analysis/figures.py](src/analysis/figures.py) and [src/analysis/tables.py](src/analysis/tables.py).
 
 ### Outputs
 
-Tables (`output/tables/`):
-- `tab01_summary_stats.csv` — counts and means by health type
-- `tab02_frailty_by_wave.csv` — frailty by wave × health type
-- `tab03_main_regression.csv` — basic vs full OLS coefficients
-- `tab04_employment_by_type.csv` — employment / unemployment / inactivity shares
-- `tab05_education_by_type.csv` — educational attainment shares (ages 50-60)
+**Tables** (`output/tables/` — both `.csv` and `.tex`):
+
+| File | Contents |
+|------|----------|
+| `tab01_summary_stats` | Counts, means, share-of-sample by health type |
+| `tab02_frailty_by_wave` | Mean frailty by wave × type |
+| `tab03_main_regression` | Stargazer-rendered basic vs. full OLS table |
+| `tab04_employment_by_type` | Employment / unemployment / inactivity shares |
+| `tab05_education_by_type` | Educational attainment shares (ages 50–60) |
+
+**Figures** (`output/figures/`, all rendered with the custom paper style in
+[src/analysis/_style.py](src/analysis/_style.py)):
+
+| File | Section | Contents |
+|------|---------|----------|
+| `fig01_frailty_trajectories.png` | Appx B | Frailty trajectories by wave |
+| `fig02_frailty_distribution.png` | Appx B | Within-cluster frailty distributions |
+| `fig03_cluster_diagnostics.png` | §3 | Elbow + silhouette twin-axis |
+| `fig04_frailty_by_age.png` | §4.1 | Binned mean frailty by age × type (headline) |
+| `fig05_employment_by_type.png` | §4.2 | Employment rate by age × type |
+| `fig06_earnings_by_type.png` | §4.2 | Mean hourly pay by age × type (95% CIs) |
+| `fig07_frailty_by_age_scatter.png` | §2 | Binned mean frailty by age, full panel |
+| `fig08_education_by_type.png` | §4.3 | Stacked-bar education shares by type |
+| `fig09_pay_by_education.png` | §4.3 | Mean hourly pay by qualification |
+| `fig10_healthcond_variables_by_wave.png` | §2 | UKHLS variable-family wave coverage |
+| `fig11_mortality_by_age.png` | §2 | Mortality (frailty=1) by age |
+
+**Metrics:** `output/metrics/metrics.json` — appended per-run with run-id, git
+commit, cluster counts, regression diagnostics.
+
+**Log:** `output/logs/pipeline.log`.
 
-Figures (`output/figures/`):
-- `fig01_frailty_trajectories.png` — frailty trajectories by wave
-- `fig02_frailty_distribution.png` — within-cluster frailty distributions
-- `fig03_cluster_diagnostics.png` — elbow and silhouette plots
-- `fig04_frailty_by_age.png` — binned mean frailty by age and health type
-- `fig05_employment_by_type.png` — employment rate by age and health type
-- `fig06_earnings_by_type.png` — mean hourly pay by age and health type with 95% CIs
+## Paper
+
+- **LaTeX sources:** [paper/tex/](paper/tex/) — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C.
+- **Build:** `./paper/build.sh` → [paper/final/final.pdf](paper/final/final.pdf) (~21 pages, 1.5 MB). Auto-runs `./run.sh` first if any required figure/table is missing.
+- **Slide deck:** [paper/slides/seminar.tex](paper/slides/seminar.tex) — 24-frame Beamer deck for a ~30-min seminar talk. Built via `./paper/slides/build.sh` → [paper/final/slides/seminar.pdf](paper/final/slides/seminar.pdf). Uses the `primary-blue`/`primary-gold` palette from [Preambles/header.tex](paper/slides/header.tex).
 
-Metrics: `output/metrics/metrics.json` (run-id, git commit, cluster counts, regression diagnostics)
-Log: `output/logs/pipeline.log`
+Both build scripts run the full pipeline first if any required artefact is
+missing, so a fresh clone with UKHLS data in place is one command away from a
+compiled PDF.
 
 ## Tests
 
 ```bash
-.venv/bin/python3 -m pytest tests/test_pipeline.py -v
+uv run pytest -q
 ```
 
-Uses a 200-individual subsample from the real frailty panel.
+The test suite ([tests/test_pipeline.py](tests/test_pipeline.py)) takes a
+200-individual subsample of the frailty panel and runs the pipeline
+end-to-end against it, asserting schema, no-duplicate keys, valid frailty
+range, and expected outputs. **When the UKHLS data isn't present**
+(e.g.\ in CI on a public runner) the suite cleanly skips every test instead
+of erroring, so the build stays green.
 
-## Notes
+## Continuous integration
 
-- Notebooks are archived in `notebooks/archive/` for provenance only; the pipeline does not depend on them.
+[.github/workflows/ci.yml](.github/workflows/ci.yml) runs a minimal smoke
+check on every push and PR, against Python 3.11 and 3.12:
 
-## Paper
+1. `uv sync` — catches dependency drift / lock-file inconsistencies.
+2. Import smoke — imports every module under `src/` to catch syntax errors
+   and broken cross-module references.
+3. `pytest -q` — collects the suite. Tests skip in CI because UKHLS data
+   isn't available; what's verified is that pytest collection succeeds and
+   the test code itself parses.
+
+The CI is honest about what it can and can't check given a restricted-data
+project: dependency / import / syntax health, not full reproduction.
+
+## Notes
 
-- LaTeX sources: `paper/tex/`
-- Compiled output: `paper/final/`
+- Notebooks are archived in [notebooks/archive/](notebooks/archive/) for
+  provenance only; the pipeline does not depend on them.
+- All figures are generated with the custom matplotlib rcParams in
+  [src/analysis/_style.py](src/analysis/_style.py) — no `plt.style.use(...)`
+  of any built-in theme.
+- The regression LaTeX table is rendered via Stargazer in
+  [src/pipeline/estimate.py](src/pipeline/estimate.py); descriptive tables
+  use `pandas.to_latex()` wrapped in booktabs in
+  [src/analysis/tables.py](src/analysis/tables.py).