artefactory
diff --git a/‎KPI_analysis/README.md‎
Lines changed: 129 additions & 0 deletions b/‎KPI_analysis/README.md‎
Lines changed: 129 additions & 0 deletions
diff --git a/‎KPI_analysis/build_dataset.py‎
Lines changed: 180 additions & 0 deletions b/‎KPI_analysis/build_dataset.py‎
Lines changed: 180 additions & 0 deletions
@@ -0,0 +1,129 @@
+# KPI_analysis
+
+Fetches annual consolidated KPIs (revenue, net income, total assets, capex, …)
+for the companies we've mapped in `tickers_lists/`. Designed to produce one
+point per company × year × KPI so downstream scripts can compare peers.
+
+## Design
+
+Hybrid pipeline — we pick the data source based on the listing exchange:
+
+| Exchange                                              | Source                   | Why                                                  |
+| ----------------------------------------------------- | ------------------------ | ---------------------------------------------------- |
+| NYSE, NYSE American (AMEX), Nasdaq (GS/GM/CM), Cboe   | **SEC EDGAR** companyfacts | Free, unlimited, full history (XBRL back to ~2009). |
+| LSE, AIM, ASX, TSX, …                                 | **yfinance** (fallback)  | No SEC filings. yfinance covers ~4 recent fiscal years. |
+
+EDGAR is the preferred source because it publishes structured XBRL for every
+10-K filing, so we can pull 6+ years of history without hitting rate limits.
+For non-US listings we fall back to yfinance (the same library we already use
+in `tickers_lists/scripts/map_tickers.py`); coverage is shallower and the field
+labels are less stable, but it's enough for the later years in our corpus.
+
+## Files
+
+```
+KPI_analysis/
+├── tags.py            # logical KPI -> candidate XBRL tags (ordered by preference)
+├── edgar.py           # SEC EDGAR client (CIK lookup, companyfacts fetch, parsing)
+├── yf_fallback.py     # yfinance fallback for non-US tickers
+├── fetch_kpis.py      # orchestrator CLI; writes output/raw/{TICKER}.json
+├── build_dataset.py   # consolidates output/raw/*.json into long + wide CSVs
+├── cache/             # ticker->CIK map and cached SEC responses (gitignored)
+└── output/
+    ├── raw/           # one JSON per ticker
+    ├── kpis_long.csv  # (ticker, year, kpi, value) long form
+    ├── kpis_wide.csv  # (ticker, year) rows × KPI columns
+    └── coverage.md    # coverage % per KPI per year
+```
+
+## Setup
+
+The only new dependency is `requests`, which is already pulled in transitively
+by `yfinance`. No extra `uv add` needed.
+
+SEC requires every request to carry a descriptive `User-Agent` header with
+contact info (see https://www.sec.gov/os/accessing-edgar-data). We default to
+`"ardian-dataset-bench research (charles.moslonka@artefact.com)"`. Override by
+exporting:
+
+```bash
+export SEC_USER_AGENT="Your Name your@email"
+```
+
+SEC throttles at 10 req/s; we self-limit to ~2 req/s. Responses are cached in
+`KPI_analysis/cache/companyfacts/CIK*.json` so re-runs are local-disk cheap.
+
+## Usage
+
+```bash
+# All companies in the selected industries (tickers_lists/grouped/selected/):
+uv run python KPI_analysis/fetch_kpis.py --selected --years 2017-2022
+
+# A single industry:
+uv run python KPI_analysis/fetch_kpis.py --industry "Consumer Cyclical / Auto Parts"
+
+# An explicit list:
+uv run python KPI_analysis/fetch_kpis.py --tickers ORLY AZO GPC --years 2017-2022
+
+# A whole cleaned CSV:
+uv run python KPI_analysis/fetch_kpis.py --csv tickers_lists/cleaned/NYSE_mapped_clean_verified.csv
+
+# Consolidate into CSVs:
+uv run python KPI_analysis/build_dataset.py
+```
+
+Per-ticker JSON looks like:
+
+```json
+{
+  "ticker": "ORLY",
+  "company_name": "O'Reilly Automotive, Inc.",
+  "exchange": "NASDAQ",
+  "source": "edgar",
+  "cik": "0000898173",
+  "years": [2017, 2018, 2019, 2020, 2021, 2022],
+  "kpis": {
+    "revenue": {"2017": 8977726000.0, "2018": 9535866000.0, "...": "..."},
+    "net_income": {"...": "..."}
+  },
+  "tag_used": {"revenue": "Revenues", "net_income": "NetIncomeLoss"}
+}
+```
+
+`tag_used` records which XBRL tag we actually pulled from — different filers
+use different tags for the same line item (e.g. `Revenues` vs
+`RevenueFromContractWithCustomerExcludingAssessedTax`), so keeping the audit
+trail is worth the extra column.
+
+## KPIs extracted
+
+Defined in `tags.py`. The full list:
+
+- **Income statement**: revenue, cost_of_revenue, gross_profit, rd_expense,
+  sga_expense, operating_income, interest_expense, income_tax_expense,
+  net_income, eps_basic, eps_diluted
+- **Balance sheet**: total_assets, total_liabilities, stockholders_equity,
+  cash_and_equivalents, long_term_debt, short_term_debt, inventory,
+  accounts_receivable, accounts_payable, shares_outstanding
+- **Cash flow**: operating_cash_flow, investing_cash_flow, financing_cash_flow,
+  capex, depreciation_amortization, dividends_paid
+
+Derived KPIs (EBITDA, margins, ROA, ROE, leverage ratios) are intentionally
+*not* computed here — they're one-liners on the wide CSV and belong in the
+analysis step, not the fetch step.
+
+## Known limitations
+
+- **yfinance depth**: for non-US tickers, only ~4 years of annual data are
+  usually available. If we need full 2017–2022 coverage for LSE/AIM, we'll
+  need a paid API (Financial Modeling Prep, Alpha Vantage Premium, or
+  EODHD).
+- **Tag drift**: XBRL tags are filer-specific. Our candidate-tag lists cover
+  the common cases but some smaller filers will miss individual KPIs.
+  `coverage.md` shows where.
+- **Fiscal year ≠ calendar year**: the `fy` field in EDGAR is the filer's
+  fiscal year, so an April-year-end company's FY2019 covers May 2018 – April
+  2019. We keep that convention; downstream analysis should consider aligning
+  to calendar year if needed.
+- **Restatements**: we keep the most recently filed value for each (ticker,
+  fy), so amended 10-K/A filings supersede originals.
@@ -0,0 +1,180 @@
+"""Consolidate per-ticker JSONs into long/wide CSVs + a coverage report.
+
+Reads KPI_analysis/output/raw/*.json (produced by fetch_kpis.py) and writes:
+  - output/kpis_long.csv   : one row per (ticker, year, kpi)
+  - output/kpis_wide.csv   : one row per (ticker, year), columns = KPI keys
+  - output/coverage.md     : per-KPI coverage % per year, plus ticker-level summary
+
+Usage:
+  uv run python KPI_analysis/build_dataset.py
+"""
+
+from __future__ import annotations
+
+import csv
+import json
+from collections import defaultdict
+from pathlib import Path
+
+from tags import KPI_DEFS
+
+HERE = Path(__file__).resolve().parent
+RAW_DIR = HERE / "output" / "raw"
+OUT_DIR = HERE / "output"
+LONG_CSV = OUT_DIR / "kpis_long.csv"
+WIDE_CSV = OUT_DIR / "kpis_wide.csv"
+COVERAGE_MD = OUT_DIR / "coverage.md"
+
+
+def load_records() -> list[dict]:
+    if not RAW_DIR.exists():
+        return []
+    return [json.loads(p.read_text()) for p in sorted(RAW_DIR.glob("*.json"))]
+
+
+def write_long(records: list[dict]) -> int:
+    rows = []
+    for rec in records:
+        for kpi, by_year in rec.get("kpis", {}).items():
+            tag = rec.get("tag_used", {}).get(kpi, "")
+            for year, val in by_year.items():
+                rows.append(
+                    {
+                        "ticker": rec["ticker"],
+                        "company_name": rec.get("company_name", ""),
+                        "exchange": rec.get("exchange", ""),
+                        "industry": rec.get("industry", ""),
+                        "source": rec.get("source", ""),
+                        "year": int(year),
+                        "kpi": kpi,
+                        "value": val,
+                        "tag": tag,
+                    }
+                )
+    OUT_DIR.mkdir(parents=True, exist_ok=True)
+    with LONG_CSV.open("w", newline="") as f:
+        w = csv.DictWriter(
+            f,
+            fieldnames=[
+                "ticker",
+                "company_name",
+                "exchange",
+                "industry",
+                "source",
+                "year",
+                "kpi",
+                "value",
+                "tag",
+            ],
+        )
+        w.writeheader()
+        w.writerows(rows)
+    return len(rows)
+
+
+def write_wide(records: list[dict]) -> int:
+    kpi_keys = [k.key for k in KPI_DEFS]
+    # (ticker, year) -> {kpi: value}
+    cells: dict[tuple[str, int], dict[str, float]] = defaultdict(dict)
+    meta: dict[str, dict[str, str]] = {}
+    for rec in records:
+        meta[rec["ticker"]] = {
+            "company_name": rec.get("company_name", ""),
+            "exchange": rec.get("exchange", ""),
+            "industry": rec.get("industry", ""),
+            "source": rec.get("source", ""),
+        }
+        for kpi, by_year in rec.get("kpis", {}).items():
+            for year, val in by_year.items():
+                cells[(rec["ticker"], int(year))][kpi] = val
+    fieldnames = [
+        "ticker",
+        "year",
+        "company_name",
+        "exchange",
+        "industry",
+        "source",
+        *kpi_keys,
+    ]
+    with WIDE_CSV.open("w", newline="") as f:
+        w = csv.DictWriter(f, fieldnames=fieldnames)
+        w.writeheader()
+        for (ticker, year), kpis in sorted(cells.items()):
+            row = {
+                "ticker": ticker,
+                "year": year,
+                **meta.get(ticker, {}),
+                **{k: kpis.get(k, "") for k in kpi_keys},
+            }
+            w.writerow(row)
+    return len(cells)
+
+
+def write_coverage(records: list[dict]) -> None:
+    years_seen: set[int] = set()
+    per_kpi_year: dict[str, dict[int, int]] = defaultdict(lambda: defaultdict(int))
+    per_ticker: dict[str, int] = {}
+    errors: list[tuple[str, str]] = []
+    source_counts: dict[str, int] = defaultdict(int)
+
+    for rec in records:
+        source_counts[rec.get("source", "unknown")] += 1
+        if rec.get("error"):
+            errors.append((rec["ticker"], rec["error"]))
+        total = 0
+        for kpi, by_year in rec.get("kpis", {}).items():
+            for year in by_year:
+                y = int(year)
+                years_seen.add(y)
+                per_kpi_year[kpi][y] += 1
+                total += 1
+        per_ticker[rec["ticker"]] = total
+
+    years_sorted = sorted(years_seen)
+    n_tickers = len(records)
+    lines: list[str] = []
+    lines.append("# KPI coverage\n")
+    lines.append(f"- Tickers processed: **{n_tickers}**")
+    for src, n in sorted(source_counts.items()):
+        lines.append(f"  - {src}: {n}")
+    lines.append(f"- Errors: **{len(errors)}**\n")
+
+    lines.append("## Per-KPI coverage (tickers with data for each year)\n")
+    header = "| KPI | " + " | ".join(str(y) for y in years_sorted) + " |"
+    sep = "| --- |" + "|".join(["---"] * len(years_sorted)) + "|"
+    lines.append(header)
+    lines.append(sep)
+    for kpi in KPI_DEFS:
+        counts = per_kpi_year.get(kpi.key, {})
+        cells = " | ".join(
+            f"{counts.get(y, 0)}/{n_tickers}" for y in years_sorted
+        )
+        lines.append(f"| `{kpi.key}` | {cells} |")
+
+    if errors:
+        lines.append("\n## Errors\n")
+        for t, e in errors[:50]:
+            lines.append(f"- `{t}`: {e}")
+        if len(errors) > 50:
+            lines.append(f"- ... and {len(errors) - 50} more")
+
+    COVERAGE_MD.write_text("\n".join(lines) + "\n")
+
+
+def main() -> int:
+    records = load_records()
+    if not records:
+        print(f"No records found in {RAW_DIR}. Run fetch_kpis.py first.")
+        return 1
+    n_long = write_long(records)
+    n_wide = write_wide(records)
+    write_coverage(records)
+    print(
+        f"Wrote {LONG_CSV.name} ({n_long} rows), {WIDE_CSV.name} ({n_wide} rows), "
+        f"{COVERAGE_MD.name}"
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())