Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow by Copilot · Pull Request #2 · nhsengland/reusable-code-audit

Copilot · 2026-06-05T18:01:00Z

This PR restructures the audit output into a single extensible catalog model. It addresses the three gaps identified: no central index, no explicit overlap counts/groupings, and no defined process for adding new repositories against existing overlaps.

Central source of truth (catalog.yml)
- Added a root-level machine-readable catalog covering:
  - metadata (total_projects_audited, total_snippets, last_updated)
  - all 7 themes with explicit overlap_count + project lists
  - all 14 snippets with stable IDs, source attribution, overlap counts, reuse candidacy, and suggested module targets
Human-readable catalog (catalog.md)
- Added a root-level rendered catalog as primary review artifact:
  - summary stats + most-overlapped theme
  - theme overlap summary table (counts, projects, top candidate)
  - per-theme sections with project links, overlap explanation, snippet inventory
  - inline raw code for every snippet (self-contained inspection)
  - cross-project overlap matrix (21 projects × 7 themes) using numeric counts
  - flat reuse-candidate table across all snippets
Extensibility workflow (ADDING-NEW-REPOS.md)
- Added a concrete 5-step process for onboarding new repos:
  1. profile repo
  2. compare to existing catalog entries
  3. add/update snippets
  4. regenerate catalog.md sections and matrix
  5. update audit-report.md
- Included copy/paste YAML template + operational checklist
Entry-point docs refresh
- Updated root README.md with a new Catalog section near the top linking catalog.md and catalog.yml with headline stats.
- Replaced code-snippets/README.md with catalog-first guidance, folder structure, and per-folder snippet counts.
Repository hygiene
- Added .gitignore rules for __pycache__/ and *.pyc.
- Removed committed __pycache__ artifacts under code-snippets/.

# catalog.yml (example entry shape now used for all snippets)
- id: SD-001
  title: "Synthetic data distribution metrics"
  theme: synthetic-data
  file: "code-snippets/synthetic-data/distribution_metrics.py"
  source_projects: [P12, P21]
  overlap_count: 2
  reuse_candidate: true
  suggested_library_module: "nhs_reuse.synthetic.metrics"

Original prompt

Context

The nhsengland/reusable-code-audit repository has already been set up with:

audit-report.md — a themed audit report
code-snippets/ — 7 themed folders each with 2 snippet files and a README
A reuse opportunity matrix in the audit report (tick-based, not count-based)

The branch copilot/setup-reusable-code-audit contains all of this work.

The user has identified three gaps they want addressed:

No central catalog — everything is split across 7 folder READMEs with no single index
Overlap numbers are not explicit — the matrix uses ✓ marks but doesn't show counts or clearly group which projects overlap on what
Not extensible — there is no defined process for "bring in a new repo, compare against existing catalog, add new overlaps"

Your Task

Work on the existing branch copilot/setup-reusable-code-audit. Do NOT merge to main.

1. Create `catalog.yml`

Create a machine-readable YAML catalog at the root of the repo: catalog.yml

This is the single source of truth for all code snippets. Structure it as follows:

meta:
  last_updated: "2025-06-05"
  total_projects_audited: 21
  total_snippets: 14  # update to match actual count

themes:
  - id: synthetic-data
    label: "Synthetic Data"
    overlap_count: 4  # number of projects with this theme
    projects: [P12, P21, P41, P11]
  - id: nlp-text-processing
    label: "NLP / Text Processing"
    overlap_count: 6
    projects: [P72, P71, P51, P33, P23, P43]
  # ... etc for all 7 themes

snippets:
  - id: SD-001
    title: "Synthetic data distribution metrics"
    theme: synthetic-data
    file: "code-snippets/synthetic-data/distribution_metrics.py"
    source_projects: [P12, P21]
    source_repos:
      - url: "https://github.com/nhsx/SynthVAE"
        file: "metrics.py"
    overlap_count: 2
    reuse_candidate: true
    reuse_notes: "Core evaluation pattern repeated in both SynthVAE versions. Could be a standalone nhsengland metrics package."
    suggested_library_module: "nhs_reuse.synthetic.metrics"

  - id: SD-002
    title: "Experiment I/O helpers"
    theme: synthetic-data
    file: "code-snippets/synthetic-data/experiment_io_helpers.py"
    source_projects: [P41]
    source_repos:
      - url: "https://github.com/nhsengland/NHSSynth"
        file: "src/nhssynth/common/io.py"
    overlap_count: 1
    reuse_candidate: true
    reuse_notes: "Path/file management utilities for experiments — patterns like this appear independently across many projects."
    suggested_library_module: "nhs_reuse.common.io"

  # Continue for ALL 14 snippets across all 7 themes:
  # NLP: NLP-001 (prepare_notes_for_lm.py), NLP-002 (clean_text_pipeline.R)
  # MM: MM-001 (embedding_dataset_wrappers.py), MM-002 (feature_name_mapping.py)
  # EVAL: EVAL-001 (bleu_precision_evaluator.py), EVAL-002 (retrieval_metrics.py)
  # XAI: XAI-001 (fairness_bootstrap_ci.py), XAI-002 (mcr_feature_grouping.py)
  # GRAPH: GRAPH-001 (degree_centrality.py), GRAPH-002 (overlap_coefficient.py)
  # DP: DP-001 (create_holdout_split.py), DP-002 (setup_input_json.py)

Fill in ALL snippets accurately based on the actual files already in the repo and the source attributions in their header comments.

2. Create `catalog.md`

Create a human-readable rendered catalog at the root: catalog.md

This should be the go-to document for understanding overlaps. Structure:

Header

Title, purpose, date
Summary stats box:
- Total projects audited: 21
- Total themes identified: 7
- Total snippet files: N
- Total reuse candidates: N
- Most overlapped theme: X (N projects)

Overlap Summary Table

A table with one row per theme:

Theme	Projects with overlap	Overlap count	Snippet files	Top reuse candidate
Synthetic Data	P12, P21, P41, P11	4	2	SD-001
NLP / Text Processing	P72, P71, P51, P33, P23, P43	6	2	NLP-001
...

Per-Theme Sections

For each theme, include:

Which projects overlap (with links to their repos)
What the overlap is — a plain-English description of the repeated pattern
Snippet inventory — a sub-table:

ID File Source projects Reuse candidate Suggested module

SD-001 distribution_metrics.py P12, P21 ✅ nhs_reuse.synthetic.metrics
Raw snippet inline — embed the actual code of each snippet inline using fenced code blocks so it can be read without navigating to the file

Cross-Project Overlap Map

A matrix table showing ALL 21 projects vs ALL 7 themes, with the count of snippets rather than just a tick mark, e.g.:

Project	Synthetic	NLP	Multimodal	Graph	Eval	Explainability	Pipelines	Total
P82	0	1	0	0	2	0	1	4
P81	0	0	2	0	1	1	1	5
...
Total snippets	2	2	2	2	2	2	2	14

Reuse Candidates Summary

A flat tab...

This pull request was created from Copilot chat.

Initial plan

a5ea07c

Copilot AI assigned Copilot and josephwilson8-nhs Jun 5, 2026

Copilot started work on behalf of josephwilson8-nhs June 5, 2026 18:01 View session

Add central snippet catalog and extension workflow docs

4f9571d

Copilot AI changed the title ~~[WIP] Add YAML catalog for reusable code audit~~ Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow Jun 5, 2026

Copilot finished work on behalf of josephwilson8-nhs June 5, 2026 18:06

Copilot AI requested a review from josephwilson8-nhs June 5, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow#2

Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow#2
Copilot wants to merge 2 commits into
copilot/setup-reusable-code-auditfrom
copilot/copilotsetup-reusable-code-audit

Copilot AI commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Your Task

1. Create catalog.yml

2. Create catalog.md

Header

Overlap Summary Table

Per-Theme Sections

Cross-Project Overlap Map

Reuse Candidates Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 5, 2026 •

edited

Loading

1. Create `catalog.yml`

2. Create `catalog.md`