Skip to content

Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow#2

Draft
Copilot wants to merge 2 commits into
copilot/setup-reusable-code-auditfrom
copilot/copilotsetup-reusable-code-audit
Draft

Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow#2
Copilot wants to merge 2 commits into
copilot/setup-reusable-code-auditfrom
copilot/copilotsetup-reusable-code-audit

Conversation

Copilot AI commented Jun 5, 2026

Copy link
Copy Markdown

This PR restructures the audit output into a single extensible catalog model. It addresses the three gaps identified: no central index, no explicit overlap counts/groupings, and no defined process for adding new repositories against existing overlaps.

  • Central source of truth (catalog.yml)

    • Added a root-level machine-readable catalog covering:
      • metadata (total_projects_audited, total_snippets, last_updated)
      • all 7 themes with explicit overlap_count + project lists
      • all 14 snippets with stable IDs, source attribution, overlap counts, reuse candidacy, and suggested module targets
  • Human-readable catalog (catalog.md)

    • Added a root-level rendered catalog as primary review artifact:
      • summary stats + most-overlapped theme
      • theme overlap summary table (counts, projects, top candidate)
      • per-theme sections with project links, overlap explanation, snippet inventory
      • inline raw code for every snippet (self-contained inspection)
      • cross-project overlap matrix (21 projects × 7 themes) using numeric counts
      • flat reuse-candidate table across all snippets
  • Extensibility workflow (ADDING-NEW-REPOS.md)

    • Added a concrete 5-step process for onboarding new repos:
      1. profile repo
      2. compare to existing catalog entries
      3. add/update snippets
      4. regenerate catalog.md sections and matrix
      5. update audit-report.md
    • Included copy/paste YAML template + operational checklist
  • Entry-point docs refresh

    • Updated root README.md with a new Catalog section near the top linking catalog.md and catalog.yml with headline stats.
    • Replaced code-snippets/README.md with catalog-first guidance, folder structure, and per-folder snippet counts.
  • Repository hygiene

    • Added .gitignore rules for __pycache__/ and *.pyc.
    • Removed committed __pycache__ artifacts under code-snippets/.
# catalog.yml (example entry shape now used for all snippets)
- id: SD-001
  title: "Synthetic data distribution metrics"
  theme: synthetic-data
  file: "code-snippets/synthetic-data/distribution_metrics.py"
  source_projects: [P12, P21]
  overlap_count: 2
  reuse_candidate: true
  suggested_library_module: "nhs_reuse.synthetic.metrics"
Original prompt

Context

The nhsengland/reusable-code-audit repository has already been set up with:

  • audit-report.md — a themed audit report
  • code-snippets/ — 7 themed folders each with 2 snippet files and a README
  • A reuse opportunity matrix in the audit report (tick-based, not count-based)

The branch copilot/setup-reusable-code-audit contains all of this work.

The user has identified three gaps they want addressed:

  1. No central catalog — everything is split across 7 folder READMEs with no single index
  2. Overlap numbers are not explicit — the matrix uses ✓ marks but doesn't show counts or clearly group which projects overlap on what
  3. Not extensible — there is no defined process for "bring in a new repo, compare against existing catalog, add new overlaps"

Your Task

Work on the existing branch copilot/setup-reusable-code-audit. Do NOT merge to main.

1. Create catalog.yml

Create a machine-readable YAML catalog at the root of the repo: catalog.yml

This is the single source of truth for all code snippets. Structure it as follows:

meta:
  last_updated: "2025-06-05"
  total_projects_audited: 21
  total_snippets: 14  # update to match actual count

themes:
  - id: synthetic-data
    label: "Synthetic Data"
    overlap_count: 4  # number of projects with this theme
    projects: [P12, P21, P41, P11]
  - id: nlp-text-processing
    label: "NLP / Text Processing"
    overlap_count: 6
    projects: [P72, P71, P51, P33, P23, P43]
  # ... etc for all 7 themes

snippets:
  - id: SD-001
    title: "Synthetic data distribution metrics"
    theme: synthetic-data
    file: "code-snippets/synthetic-data/distribution_metrics.py"
    source_projects: [P12, P21]
    source_repos:
      - url: "https://github.com/nhsx/SynthVAE"
        file: "metrics.py"
    overlap_count: 2
    reuse_candidate: true
    reuse_notes: "Core evaluation pattern repeated in both SynthVAE versions. Could be a standalone nhsengland metrics package."
    suggested_library_module: "nhs_reuse.synthetic.metrics"

  - id: SD-002
    title: "Experiment I/O helpers"
    theme: synthetic-data
    file: "code-snippets/synthetic-data/experiment_io_helpers.py"
    source_projects: [P41]
    source_repos:
      - url: "https://github.com/nhsengland/NHSSynth"
        file: "src/nhssynth/common/io.py"
    overlap_count: 1
    reuse_candidate: true
    reuse_notes: "Path/file management utilities for experiments — patterns like this appear independently across many projects."
    suggested_library_module: "nhs_reuse.common.io"

  # Continue for ALL 14 snippets across all 7 themes:
  # NLP: NLP-001 (prepare_notes_for_lm.py), NLP-002 (clean_text_pipeline.R)
  # MM: MM-001 (embedding_dataset_wrappers.py), MM-002 (feature_name_mapping.py)
  # EVAL: EVAL-001 (bleu_precision_evaluator.py), EVAL-002 (retrieval_metrics.py)
  # XAI: XAI-001 (fairness_bootstrap_ci.py), XAI-002 (mcr_feature_grouping.py)
  # GRAPH: GRAPH-001 (degree_centrality.py), GRAPH-002 (overlap_coefficient.py)
  # DP: DP-001 (create_holdout_split.py), DP-002 (setup_input_json.py)

Fill in ALL snippets accurately based on the actual files already in the repo and the source attributions in their header comments.


2. Create catalog.md

Create a human-readable rendered catalog at the root: catalog.md

This should be the go-to document for understanding overlaps. Structure:

Header

  • Title, purpose, date
  • Summary stats box:
    • Total projects audited: 21
    • Total themes identified: 7
    • Total snippet files: N
    • Total reuse candidates: N
    • Most overlapped theme: X (N projects)

Overlap Summary Table

A table with one row per theme:

Theme Projects with overlap Overlap count Snippet files Top reuse candidate
Synthetic Data P12, P21, P41, P11 4 2 SD-001
NLP / Text Processing P72, P71, P51, P33, P23, P43 6 2 NLP-001
...

Per-Theme Sections

For each theme, include:

  • Which projects overlap (with links to their repos)

  • What the overlap is — a plain-English description of the repeated pattern

  • Snippet inventory — a sub-table:

    ID File Source projects Reuse candidate Suggested module
    SD-001 distribution_metrics.py P12, P21 nhs_reuse.synthetic.metrics
  • Raw snippet inline — embed the actual code of each snippet inline using fenced code blocks so it can be read without navigating to the file

Cross-Project Overlap Map

A matrix table showing ALL 21 projects vs ALL 7 themes, with the count of snippets rather than just a tick mark, e.g.:

Project Synthetic NLP Multimodal Graph Eval Explainability Pipelines Total
P82 0 1 0 0 2 0 1 4
P81 0 0 2 0 1 1 1 5
...
Total snippets 2 2 2 2 2 2 2 14

Reuse Candidates Summary

A flat tab...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add YAML catalog for reusable code audit Add central reusable-code catalog with explicit overlap counts and repo-onboarding workflow Jun 5, 2026
Copilot AI requested a review from josephwilson8-nhs June 5, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants