Skip to content

Commit 4bc915f

Browse files
feat: KPI analysis
Get from APIs the KPI that should be contained in the annual reports. We have a waterfall approach to XBRL tags, this should be better tested...
1 parent 81a30de commit 4bc915f

7 files changed

Lines changed: 1396 additions & 0 deletions

File tree

KPI_analysis/README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# KPI_analysis
2+
3+
Fetches annual consolidated KPIs (revenue, net income, total assets, capex, …)
4+
for the companies we've mapped in `tickers_lists/`. Designed to produce one
5+
point per company × year × KPI so downstream scripts can compare peers.
6+
7+
## Design
8+
9+
Hybrid pipeline — we pick the data source based on the listing exchange:
10+
11+
| Exchange | Source | Why |
12+
| ----------------------------------------------------- | ------------------------ | ---------------------------------------------------- |
13+
| NYSE, NYSE American (AMEX), Nasdaq (GS/GM/CM), Cboe | **SEC EDGAR** companyfacts | Free, unlimited, full history (XBRL back to ~2009). |
14+
| LSE, AIM, ASX, TSX, … | **yfinance** (fallback) | No SEC filings. yfinance covers ~4 recent fiscal years. |
15+
16+
EDGAR is the preferred source because it publishes structured XBRL for every
17+
10-K filing, so we can pull 6+ years of history without hitting rate limits.
18+
For non-US listings we fall back to yfinance (the same library we already use
19+
in `tickers_lists/scripts/map_tickers.py`); coverage is shallower and the field
20+
labels are less stable, but it's enough for the later years in our corpus.
21+
22+
## Files
23+
24+
```
25+
KPI_analysis/
26+
├── tags.py # logical KPI -> candidate XBRL tags (ordered by preference)
27+
├── edgar.py # SEC EDGAR client (CIK lookup, companyfacts fetch, parsing)
28+
├── yf_fallback.py # yfinance fallback for non-US tickers
29+
├── fetch_kpis.py # orchestrator CLI; writes output/raw/{TICKER}.json
30+
├── build_dataset.py # consolidates output/raw/*.json into long + wide CSVs
31+
├── cache/ # ticker->CIK map and cached SEC responses (gitignored)
32+
└── output/
33+
├── raw/ # one JSON per ticker
34+
├── kpis_long.csv # (ticker, year, kpi, value) long form
35+
├── kpis_wide.csv # (ticker, year) rows × KPI columns
36+
└── coverage.md # coverage % per KPI per year
37+
```
38+
39+
## Setup
40+
41+
The only new dependency is `requests`, which is already pulled in transitively
42+
by `yfinance`. No extra `uv add` needed.
43+
44+
SEC requires every request to carry a descriptive `User-Agent` header with
45+
contact info (see https://www.sec.gov/os/accessing-edgar-data). We default to
46+
`"ardian-dataset-bench research (charles.moslonka@artefact.com)"`. Override by
47+
exporting:
48+
49+
```bash
50+
export SEC_USER_AGENT="Your Name your@email"
51+
```
52+
53+
SEC throttles at 10 req/s; we self-limit to ~2 req/s. Responses are cached in
54+
`KPI_analysis/cache/companyfacts/CIK*.json` so re-runs are local-disk cheap.
55+
56+
## Usage
57+
58+
```bash
59+
# All companies in the selected industries (tickers_lists/grouped/selected/):
60+
uv run python KPI_analysis/fetch_kpis.py --selected --years 2017-2022
61+
62+
# A single industry:
63+
uv run python KPI_analysis/fetch_kpis.py --industry "Consumer Cyclical / Auto Parts"
64+
65+
# An explicit list:
66+
uv run python KPI_analysis/fetch_kpis.py --tickers ORLY AZO GPC --years 2017-2022
67+
68+
# A whole cleaned CSV:
69+
uv run python KPI_analysis/fetch_kpis.py --csv tickers_lists/cleaned/NYSE_mapped_clean_verified.csv
70+
71+
# Consolidate into CSVs:
72+
uv run python KPI_analysis/build_dataset.py
73+
```
74+
75+
Per-ticker JSON looks like:
76+
77+
```json
78+
{
79+
"ticker": "ORLY",
80+
"company_name": "O'Reilly Automotive, Inc.",
81+
"exchange": "NASDAQ",
82+
"source": "edgar",
83+
"cik": "0000898173",
84+
"years": [2017, 2018, 2019, 2020, 2021, 2022],
85+
"kpis": {
86+
"revenue": {"2017": 8977726000.0, "2018": 9535866000.0, "...": "..."},
87+
"net_income": {"...": "..."}
88+
},
89+
"tag_used": {"revenue": "Revenues", "net_income": "NetIncomeLoss"}
90+
}
91+
```
92+
93+
`tag_used` records which XBRL tag we actually pulled from — different filers
94+
use different tags for the same line item (e.g. `Revenues` vs
95+
`RevenueFromContractWithCustomerExcludingAssessedTax`), so keeping the audit
96+
trail is worth the extra column.
97+
98+
## KPIs extracted
99+
100+
Defined in `tags.py`. The full list:
101+
102+
- **Income statement**: revenue, cost_of_revenue, gross_profit, rd_expense,
103+
sga_expense, operating_income, interest_expense, income_tax_expense,
104+
net_income, eps_basic, eps_diluted
105+
- **Balance sheet**: total_assets, total_liabilities, stockholders_equity,
106+
cash_and_equivalents, long_term_debt, short_term_debt, inventory,
107+
accounts_receivable, accounts_payable, shares_outstanding
108+
- **Cash flow**: operating_cash_flow, investing_cash_flow, financing_cash_flow,
109+
capex, depreciation_amortization, dividends_paid
110+
111+
Derived KPIs (EBITDA, margins, ROA, ROE, leverage ratios) are intentionally
112+
*not* computed here — they're one-liners on the wide CSV and belong in the
113+
analysis step, not the fetch step.
114+
115+
## Known limitations
116+
117+
- **yfinance depth**: for non-US tickers, only ~4 years of annual data are
118+
usually available. If we need full 2017–2022 coverage for LSE/AIM, we'll
119+
need a paid API (Financial Modeling Prep, Alpha Vantage Premium, or
120+
EODHD).
121+
- **Tag drift**: XBRL tags are filer-specific. Our candidate-tag lists cover
122+
the common cases but some smaller filers will miss individual KPIs.
123+
`coverage.md` shows where.
124+
- **Fiscal year ≠ calendar year**: the `fy` field in EDGAR is the filer's
125+
fiscal year, so an April-year-end company's FY2019 covers May 2018 – April
126+
2019. We keep that convention; downstream analysis should consider aligning
127+
to calendar year if needed.
128+
- **Restatements**: we keep the most recently filed value for each (ticker,
129+
fy), so amended 10-K/A filings supersede originals.

KPI_analysis/build_dataset.py

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
"""Consolidate per-ticker JSONs into long/wide CSVs + a coverage report.
2+
3+
Reads KPI_analysis/output/raw/*.json (produced by fetch_kpis.py) and writes:
4+
- output/kpis_long.csv : one row per (ticker, year, kpi)
5+
- output/kpis_wide.csv : one row per (ticker, year), columns = KPI keys
6+
- output/coverage.md : per-KPI coverage % per year, plus ticker-level summary
7+
8+
Usage:
9+
uv run python KPI_analysis/build_dataset.py
10+
"""
11+
12+
from __future__ import annotations
13+
14+
import csv
15+
import json
16+
from collections import defaultdict
17+
from pathlib import Path
18+
19+
from tags import KPI_DEFS
20+
21+
HERE = Path(__file__).resolve().parent
22+
RAW_DIR = HERE / "output" / "raw"
23+
OUT_DIR = HERE / "output"
24+
LONG_CSV = OUT_DIR / "kpis_long.csv"
25+
WIDE_CSV = OUT_DIR / "kpis_wide.csv"
26+
COVERAGE_MD = OUT_DIR / "coverage.md"
27+
28+
29+
def load_records() -> list[dict]:
30+
if not RAW_DIR.exists():
31+
return []
32+
return [json.loads(p.read_text()) for p in sorted(RAW_DIR.glob("*.json"))]
33+
34+
35+
def write_long(records: list[dict]) -> int:
36+
rows = []
37+
for rec in records:
38+
for kpi, by_year in rec.get("kpis", {}).items():
39+
tag = rec.get("tag_used", {}).get(kpi, "")
40+
for year, val in by_year.items():
41+
rows.append(
42+
{
43+
"ticker": rec["ticker"],
44+
"company_name": rec.get("company_name", ""),
45+
"exchange": rec.get("exchange", ""),
46+
"industry": rec.get("industry", ""),
47+
"source": rec.get("source", ""),
48+
"year": int(year),
49+
"kpi": kpi,
50+
"value": val,
51+
"tag": tag,
52+
}
53+
)
54+
OUT_DIR.mkdir(parents=True, exist_ok=True)
55+
with LONG_CSV.open("w", newline="") as f:
56+
w = csv.DictWriter(
57+
f,
58+
fieldnames=[
59+
"ticker",
60+
"company_name",
61+
"exchange",
62+
"industry",
63+
"source",
64+
"year",
65+
"kpi",
66+
"value",
67+
"tag",
68+
],
69+
)
70+
w.writeheader()
71+
w.writerows(rows)
72+
return len(rows)
73+
74+
75+
def write_wide(records: list[dict]) -> int:
76+
kpi_keys = [k.key for k in KPI_DEFS]
77+
# (ticker, year) -> {kpi: value}
78+
cells: dict[tuple[str, int], dict[str, float]] = defaultdict(dict)
79+
meta: dict[str, dict[str, str]] = {}
80+
for rec in records:
81+
meta[rec["ticker"]] = {
82+
"company_name": rec.get("company_name", ""),
83+
"exchange": rec.get("exchange", ""),
84+
"industry": rec.get("industry", ""),
85+
"source": rec.get("source", ""),
86+
}
87+
for kpi, by_year in rec.get("kpis", {}).items():
88+
for year, val in by_year.items():
89+
cells[(rec["ticker"], int(year))][kpi] = val
90+
fieldnames = [
91+
"ticker",
92+
"year",
93+
"company_name",
94+
"exchange",
95+
"industry",
96+
"source",
97+
*kpi_keys,
98+
]
99+
with WIDE_CSV.open("w", newline="") as f:
100+
w = csv.DictWriter(f, fieldnames=fieldnames)
101+
w.writeheader()
102+
for (ticker, year), kpis in sorted(cells.items()):
103+
row = {
104+
"ticker": ticker,
105+
"year": year,
106+
**meta.get(ticker, {}),
107+
**{k: kpis.get(k, "") for k in kpi_keys},
108+
}
109+
w.writerow(row)
110+
return len(cells)
111+
112+
113+
def write_coverage(records: list[dict]) -> None:
114+
years_seen: set[int] = set()
115+
per_kpi_year: dict[str, dict[int, int]] = defaultdict(lambda: defaultdict(int))
116+
per_ticker: dict[str, int] = {}
117+
errors: list[tuple[str, str]] = []
118+
source_counts: dict[str, int] = defaultdict(int)
119+
120+
for rec in records:
121+
source_counts[rec.get("source", "unknown")] += 1
122+
if rec.get("error"):
123+
errors.append((rec["ticker"], rec["error"]))
124+
total = 0
125+
for kpi, by_year in rec.get("kpis", {}).items():
126+
for year in by_year:
127+
y = int(year)
128+
years_seen.add(y)
129+
per_kpi_year[kpi][y] += 1
130+
total += 1
131+
per_ticker[rec["ticker"]] = total
132+
133+
years_sorted = sorted(years_seen)
134+
n_tickers = len(records)
135+
lines: list[str] = []
136+
lines.append("# KPI coverage\n")
137+
lines.append(f"- Tickers processed: **{n_tickers}**")
138+
for src, n in sorted(source_counts.items()):
139+
lines.append(f" - {src}: {n}")
140+
lines.append(f"- Errors: **{len(errors)}**\n")
141+
142+
lines.append("## Per-KPI coverage (tickers with data for each year)\n")
143+
header = "| KPI | " + " | ".join(str(y) for y in years_sorted) + " |"
144+
sep = "| --- |" + "|".join(["---"] * len(years_sorted)) + "|"
145+
lines.append(header)
146+
lines.append(sep)
147+
for kpi in KPI_DEFS:
148+
counts = per_kpi_year.get(kpi.key, {})
149+
cells = " | ".join(
150+
f"{counts.get(y, 0)}/{n_tickers}" for y in years_sorted
151+
)
152+
lines.append(f"| `{kpi.key}` | {cells} |")
153+
154+
if errors:
155+
lines.append("\n## Errors\n")
156+
for t, e in errors[:50]:
157+
lines.append(f"- `{t}`: {e}")
158+
if len(errors) > 50:
159+
lines.append(f"- ... and {len(errors) - 50} more")
160+
161+
COVERAGE_MD.write_text("\n".join(lines) + "\n")
162+
163+
164+
def main() -> int:
165+
records = load_records()
166+
if not records:
167+
print(f"No records found in {RAW_DIR}. Run fetch_kpis.py first.")
168+
return 1
169+
n_long = write_long(records)
170+
n_wide = write_wide(records)
171+
write_coverage(records)
172+
print(
173+
f"Wrote {LONG_CSV.name} ({n_long} rows), {WIDE_CSV.name} ({n_wide} rows), "
174+
f"{COVERAGE_MD.name}"
175+
)
176+
return 0
177+
178+
179+
if __name__ == "__main__":
180+
raise SystemExit(main())

0 commit comments

Comments
 (0)