Design generic UDP publication pipeline for APEx Algorithm Catalogue

## Context

With BAIS2 now merged into the APEx Algorithm Catalogue ([#7](https://github.com/developmentseed/openeo-udp/issues/7)), we have a validated end-to-end publication path. However, the process was entirely manual: hand-crafting the OGC API record JSON, manually copying images, fixing URLs during review, running QA tests locally, and coordinating across two repositories over several weeks.

This ticket is about **designing and implementing a generic, reproducible publication workflow** — the tooling, conventions, and automation that make publishing a UDP to APEx a routine step rather than a project in itself.

## Problem

The BAIS2 publication surfaced the following pain points:

1. **Record authoring is error-prone** — the `<alg>.json` record requires ~10 interlinked URLs (application, webapp, notebook, preview, thumbnail, provider, platform, service, code, about). Getting them all right — including the `refs/heads/main` vs branch-ref distinction — took multiple review rounds with the APEx team.
2. **No structured metadata in notebooks** — title, description, keywords, license, original evalscript URL, and author attribution are written as prose in markdown cells. There is nothing machine-readable to extract from.
3. **Image generation is ad-hoc** — `preview.png` and `thumbnail.png` were manually exported. There is no convention for size, format, or which notebook output to use.
4. **Validation requires cloning a separate repo** — running `pytest qa/unittests/tests/test_records.py` from `apex_algorithms` is a manual step disconnected from our own CI.
5. **Two-repo coordination** — every publication touches both `openeo-udp` (source of truth) and `apex_algorithms` (catalogue registry). There is no automation bridging the two.

## Objective

Design a publication pipeline where, given a validated notebook with proper metadata, publishing to APEx requires running a single command (or is triggered automatically on merge).

## Design scope

### 1. Notebook metadata convention

Define a structured metadata cell or sidecar file that each notebook must include. This is the single source of truth for the APEx record. At minimum:

- `id` — algorithm identifier (e.g. `bais2`, `ndci`)
- `title` — human-readable name
- `description` — algorithm summary
- `keywords` — list of tags
- `license` — SPDX identifier (e.g. `CC-BY-4.0`)
- `original_evalscript` — URL to the Sentinel Hub custom script
- `author` — original evalscript author name and attribution
- `citation` — DOI or reference to the scientific paper
- `backend` — target openEO backend URL
- `collection_id` — the openEO collection used
- `preview_cell` / `thumbnail_cell` — which notebook output cells to use for image generation (or explicit image paths)

**Open question**: JSON sidecar file per notebook (e.g. `bais2.meta.json`) vs. a tagged cell inside the notebook? Sidecar is easier to parse; tagged cell keeps everything in one file.

### 2. Record generator

A Python CLI/script that:

- Reads the notebook metadata
- Reads the exported UDP JSON (already produced by notebooks)
- Generates the complete `<alg>.json` OGC API Record with all required links, correctly templated for the `apex_algorithms` directory structure
- Generates the `webapp` URL with proper openEO Editor query parameters
- Outputs the full directory tree ready to copy into `apex_algorithms`:
  ```
  algorithm_catalog/developmentseed/<ALG_ID>/
  ├── openeo_udp/<ALG_ID>.json
  └── records/
      ├── <ALG_ID>.json
      ├── preview.png
      └── thumbnail.png
  ```

### 3. Image extraction

Standardize preview/thumbnail generation:

- Define target dimensions and format (the APEx catalogue likely has constraints worth documenting)
- Either extract from notebook output cells automatically (e.g. via `nbconvert` or `nbformat` to pull specific cell outputs) or require them as committed files in a known location
- Resize/crop to thumbnail dimensions

### 4. Local validation

Integrate APEx QA validation into our own workflow:

- Vendor or wrap the `apex_algorithms` QA tooling so it can be run from `openeo-udp` against generated records without cloning the full `apex_algorithms` repo
- Or: add a `make validate-apex ALG=bais2` target that runs the checks locally

### 5. CI/CD automation (GitHub Actions)

On merge to `main` in `openeo-udp`, when a notebook and its UDP JSON are present:

- Extract metadata and generate the APEx record files
- Run APEx QA validation
- Optionally: open a draft PR on `apex_algorithms` via GitHub Actions (cross-repo PR or bot-assisted)
- At minimum: produce the ready-to-submit files as a CI artifact that can be downloaded and used to manually open the PR

### 6. Documentation

- Update `docs/publication.md` to reflect the automated workflow
- Update `CONTRIBUTING.md` to document the metadata convention
- Provide a worked example showing the full flow from notebook to catalogue entry

## Non-goals (for now)

- Fully automated merge into `apex_algorithms` without human review — the APEx team requires PR review, and the preview feature only works from their repo. The goal is to automate **our side** of the preparation.
- Multi-platform records (e.g. CDSE + TiTiler-openEO in one record) — keep it single-platform for now, extensible later.

## Approach

1. **Audit the BAIS2 record** — extract the exact schema used and identify which fields are static (provider, platform, code) vs. per-algorithm (title, description, keywords, application URL, images).
2. **Define the metadata spec** — propose the convention, get team agreement.
3. **Build the generator** — Python script, tested against BAIS2 as the reference.
4. **Retrofit BAIS2 and NDCI** — add metadata to existing notebooks, verify the generator reproduces the correct BAIS2 record.
5. **Add CI** — GitHub Actions workflow.
6. **Document** — update publication guide and contributing docs.

## References

- [docs/publication.md](https://github.com/developmentseed/openeo-udp/blob/main/docs/publication.md) — current manual publication guide
- [BAIS2 PR #316](https://github.com/ESA-APEx/apex_algorithms/pull/316) → merged via [#378](https://github.com/ESA-APEx/apex_algorithms/pull/378) — reference for record schema and review process
- [APEx QA tooling](https://github.com/ESA-APEx/apex_algorithms/tree/main/qa) — validation tests to integrate

## Acceptance Criteria

- [ ] Metadata convention defined and documented
- [ ] Generator script produces a valid APEx record from notebook metadata + UDP JSON
- [ ] Generator output passes APEx QA validation for at least BAIS2 (retrofit)
- [ ] CI workflow runs on merge and produces publication-ready artifacts
- [ ] `docs/publication.md` and `CONTRIBUTING.md` updated
- [ ] At least one new UDP published using the automated pipeline (proving it works beyond BAIS2)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design generic UDP publication pipeline for APEx Algorithm Catalogue #32

Context

Problem

Objective

Design scope

1. Notebook metadata convention

2. Record generator

3. Image extraction

4. Local validation

5. CI/CD automation (GitHub Actions)

6. Documentation

Non-goals (for now)

Approach

References

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Design generic UDP publication pipeline for APEx Algorithm Catalogue #32

Description

Context

Problem

Objective

Design scope

1. Notebook metadata convention

2. Record generator

3. Image extraction

4. Local validation

5. CI/CD automation (GitHub Actions)

6. Documentation

Non-goals (for now)

Approach

References

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions