Skip to content

Commit f54b248

Browse files
committed
update readme
1 parent 4191c15 commit f54b248

1 file changed

Lines changed: 111 additions & 41 deletions

File tree

README.md

Lines changed: 111 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,148 @@
11
# predoc-coding-sample
22

3-
Clean, reproducible research pipeline for an applied micro / health & labor project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, then estimates how those types relate to labour-market outcomes.
3+
Clean, reproducible research pipeline for an applied micro / health & labour
4+
project on UKHLS data. The pipeline identifies latent health types via K-Means
5+
clustering on frailty trajectories, fits HC3 OLS regressions with full
6+
diagnostics, and emits publication-ready tables and figures that the paper
7+
and slide deck consume directly.
8+
9+
[![smoke](https://github.com/gavinjqu/predoc-coding-sample/actions/workflows/ci.yml/badge.svg)](https://github.com/gavinjqu/predoc-coding-sample/actions/workflows/ci.yml)
410

511
## Setup
12+
13+
The repo is managed with [uv](https://docs.astral.sh/uv/):
14+
615
```bash
7-
python3 -m venv .venv
8-
source .venv/bin/activate
9-
python -m pip install -e .
16+
uv sync
1017
```
1118

19+
This creates `.venv/`, pins Python to `>=3.11`, and installs every dependency
20+
listed in `pyproject.toml` against the locked `uv.lock`.
21+
1222
## Data
1323

14-
- `data/raw/frailty_long_panel.parquet` — frailty index, age, wave, death (gitignored)
15-
- `data/raw/ukhls_demographic_panel.parquet` — sex, education, labour-force status, hourly pay (gitignored, produced by `ingest_ukhls`)
16-
- `data/raw/death_data_long_panel.parquet` — death records (gitignored)
17-
- `data/derived/` — intermediate pipeline outputs (gitignored, regenerated each run)
24+
UKHLS microdata is restricted and **gitignored**:
1825

19-
The raw `*.dta` files in `data/ukhls/` are gitignored. Once `ukhls_demographic_panel.parquet` is generated they can be deleted.
26+
| Path | Contents | Source |
27+
|------|----------|--------|
28+
| `data/ukhls/{a..m}_indresp.dta` | Raw UKHLS Stata files | UK Data Service licence |
29+
| `data/raw/frailty_long_panel.parquet` | Per-(pidp, wave) frailty index, age, death, raw `hcond*` flags | `notebooks/archive/frailty_main.ipynb` |
30+
| `data/raw/ukhls_demographic_panel.parquet` | Per-(pidp, wave) sex, education, labour-force status, hourly pay | `python -m src.pipeline.ingest_ukhls` |
31+
| `data/raw/death_data_long_panel.parquet` | Death records | UKHLS death linkage |
32+
| `data/derived/` | Intermediate pipeline outputs (regenerated each run) | `./run.sh` |
2033

21-
## One-time setup: build the demographic panel
34+
The `.dta` directory can be deleted once `ukhls_demographic_panel.parquet`
35+
exists; the pipeline reads only the parquet artefacts.
2236

23-
If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from the UKHLS Stata files:
37+
### One-time setup: build the demographic panel
38+
39+
If `data/raw/ukhls_demographic_panel.parquet` doesn't exist yet, build it from
40+
the UKHLS Stata files:
2441

2542
```bash
26-
python -m src.pipeline.ingest_ukhls --ukhls-dir data/ukhls --output data/raw/ukhls_demographic_panel.parquet
43+
uv run python -m src.pipeline.ingest_ukhls \
44+
--ukhls-dir data/ukhls \
45+
--output data/raw/ukhls_demographic_panel.parquet
2746
```
2847

48+
This is a one-shot conversion (~30s on first run, 5 MB output). Once it's
49+
done, you can `rm -rf data/ukhls/`.
50+
2951
## Run the pipeline
3052

3153
```bash
3254
./run.sh
33-
# or
34-
.venv/bin/python3 -m src.cli --config configs/config.yaml
55+
# equivalent to
56+
uv run python -m src.cli --config configs/config.yaml
3557
```
3658

37-
**Pipeline steps:**
38-
1. `ingest` — reads frailty + demographic parquets, merges on `(pidp, wave)`, computes lagged frailty
39-
2. `cluster` — K-Means on per-individual frailty trajectories at ages 50-60 (k=3 by default)
40-
3. `estimate` — OLS of frailty on demographic controls, with cluster dummies; HC3 SEs and full diagnostic suite (joint F-test, Breusch-Pagan, Durbin-Watson, VIFs)
41-
4. `report` — generates summary tables and figures
59+
**Pipeline steps** (defined in [src/cli.py](src/cli.py)):
60+
61+
1. **`ingest`** — reads frailty + demographic parquets, merges on `(pidp, wave)`, renames keys, converts wave letters to integers, computes lagged frailty.
62+
2. **`cluster`** — K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel.
63+
3. **`estimate`** — OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs.
64+
4. **`report`** — generates 5 tables (CSV + booktabs `.tex`) and 11 figures, all driven by [src/analysis/figures.py](src/analysis/figures.py) and [src/analysis/tables.py](src/analysis/tables.py).
4265

4366
### Outputs
4467

45-
Tables (`output/tables/`):
46-
- `tab01_summary_stats.csv` — counts and means by health type
47-
- `tab02_frailty_by_wave.csv` — frailty by wave × health type
48-
- `tab03_main_regression.csv` — basic vs full OLS coefficients
49-
- `tab04_employment_by_type.csv` — employment / unemployment / inactivity shares
50-
- `tab05_education_by_type.csv` — educational attainment shares (ages 50-60)
68+
**Tables** (`output/tables/` — both `.csv` and `.tex`):
69+
70+
| File | Contents |
71+
|------|----------|
72+
| `tab01_summary_stats` | Counts, means, share-of-sample by health type |
73+
| `tab02_frailty_by_wave` | Mean frailty by wave × type |
74+
| `tab03_main_regression` | Stargazer-rendered basic vs. full OLS table |
75+
| `tab04_employment_by_type` | Employment / unemployment / inactivity shares |
76+
| `tab05_education_by_type` | Educational attainment shares (ages 50–60) |
77+
78+
**Figures** (`output/figures/`, all rendered with the custom paper style in
79+
[src/analysis/_style.py](src/analysis/_style.py)):
80+
81+
| File | Section | Contents |
82+
|------|---------|----------|
83+
| `fig01_frailty_trajectories.png` | Appx B | Frailty trajectories by wave |
84+
| `fig02_frailty_distribution.png` | Appx B | Within-cluster frailty distributions |
85+
| `fig03_cluster_diagnostics.png` | §3 | Elbow + silhouette twin-axis |
86+
| `fig04_frailty_by_age.png` | §4.1 | Binned mean frailty by age × type (headline) |
87+
| `fig05_employment_by_type.png` | §4.2 | Employment rate by age × type |
88+
| `fig06_earnings_by_type.png` | §4.2 | Mean hourly pay by age × type (95% CIs) |
89+
| `fig07_frailty_by_age_scatter.png` | §2 | Binned mean frailty by age, full panel |
90+
| `fig08_education_by_type.png` | §4.3 | Stacked-bar education shares by type |
91+
| `fig09_pay_by_education.png` | §4.3 | Mean hourly pay by qualification |
92+
| `fig10_healthcond_variables_by_wave.png` | §2 | UKHLS variable-family wave coverage |
93+
| `fig11_mortality_by_age.png` | §2 | Mortality (frailty=1) by age |
94+
95+
**Metrics:** `output/metrics/metrics.json` — appended per-run with run-id, git
96+
commit, cluster counts, regression diagnostics.
97+
98+
**Log:** `output/logs/pipeline.log`.
5199

52-
Figures (`output/figures/`):
53-
- `fig01_frailty_trajectories.png` — frailty trajectories by wave
54-
- `fig02_frailty_distribution.png` — within-cluster frailty distributions
55-
- `fig03_cluster_diagnostics.png` — elbow and silhouette plots
56-
- `fig04_frailty_by_age.png` — binned mean frailty by age and health type
57-
- `fig05_employment_by_type.png` — employment rate by age and health type
58-
- `fig06_earnings_by_type.png` — mean hourly pay by age and health type with 95% CIs
100+
## Paper
101+
102+
- **LaTeX sources:** [paper/tex/](paper/tex/) — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C.
103+
- **Build:** `./paper/build.sh`[paper/final/final.pdf](paper/final/final.pdf) (~21 pages, 1.5 MB). Auto-runs `./run.sh` first if any required figure/table is missing.
104+
- **Slide deck:** [paper/slides/seminar.tex](paper/slides/seminar.tex) — 24-frame Beamer deck for a ~30-min seminar talk. Built via `./paper/slides/build.sh`[paper/final/slides/seminar.pdf](paper/final/slides/seminar.pdf). Uses the `primary-blue`/`primary-gold` palette from [Preambles/header.tex](paper/slides/header.tex).
59105

60-
Metrics: `output/metrics/metrics.json` (run-id, git commit, cluster counts, regression diagnostics)
61-
Log: `output/logs/pipeline.log`
106+
Both build scripts run the full pipeline first if any required artefact is
107+
missing, so a fresh clone with UKHLS data in place is one command away from a
108+
compiled PDF.
62109

63110
## Tests
64111

65112
```bash
66-
.venv/bin/python3 -m pytest tests/test_pipeline.py -v
113+
uv run pytest -q
67114
```
68115

69-
Uses a 200-individual subsample from the real frailty panel.
116+
The test suite ([tests/test_pipeline.py](tests/test_pipeline.py)) takes a
117+
200-individual subsample of the frailty panel and runs the pipeline
118+
end-to-end against it, asserting schema, no-duplicate keys, valid frailty
119+
range, and expected outputs. **When the UKHLS data isn't present**
120+
(e.g.\ in CI on a public runner) the suite cleanly skips every test instead
121+
of erroring, so the build stays green.
70122

71-
## Notes
123+
## Continuous integration
72124

73-
- Notebooks are archived in `notebooks/archive/` for provenance only; the pipeline does not depend on them.
125+
[.github/workflows/ci.yml](.github/workflows/ci.yml) runs a minimal smoke
126+
check on every push and PR, against Python 3.11 and 3.12:
74127

75-
## Paper
128+
1. `uv sync` — catches dependency drift / lock-file inconsistencies.
129+
2. Import smoke — imports every module under `src/` to catch syntax errors
130+
and broken cross-module references.
131+
3. `pytest -q` — collects the suite. Tests skip in CI because UKHLS data
132+
isn't available; what's verified is that pytest collection succeeds and
133+
the test code itself parses.
134+
135+
The CI is honest about what it can and can't check given a restricted-data
136+
project: dependency / import / syntax health, not full reproduction.
137+
138+
## Notes
76139

77-
- LaTeX sources: `paper/tex/`
78-
- Compiled output: `paper/final/`
140+
- Notebooks are archived in [notebooks/archive/](notebooks/archive/) for
141+
provenance only; the pipeline does not depend on them.
142+
- All figures are generated with the custom matplotlib rcParams in
143+
[src/analysis/_style.py](src/analysis/_style.py) — no `plt.style.use(...)`
144+
of any built-in theme.
145+
- The regression LaTeX table is rendered via Stargazer in
146+
[src/pipeline/estimate.py](src/pipeline/estimate.py); descriptive tables
147+
use `pandas.to_latex()` wrapped in booktabs in
148+
[src/analysis/tables.py](src/analysis/tables.py).

0 commit comments

Comments
 (0)