| status | done | ||
|---|---|---|---|
| depends | |||
| specs |
|
||
| issues | |||
| pr | 109 |
After PR #107 the importer surfaces blog post bodies, but media references still point at the legacy laddr server (https://codeforphilly.org/thumbnail/<id>/<dim>). 215 such references across 138 posts. At cutover (laddr decommission) every image breaks.
Fix: capture each referenced media item's bytes at import time, store as a gitsheets attachment scoped to the owning blog post record, rewrite the body's media URLs to point at the local /api/attachments/:key route.
This is the durable-record path — original bytes land in the data repo and travel with every clone. Runtime thumbnail resizing (so a 200×200 card doesn't pull a 2 MB original) is deferred to #108; this plan ships originals only.
- behaviors/storage.md — attachments per record, served via
GET /api/attachments/:key. - data-model.md → BlogPost — adds an "Attachments" note documenting the convention.
Better than raw integer media IDs. Format:
<caption-slug-or-image>-<MediaID>.<ext>
- Caption non-empty:
slugify(caption).slice(0, 80) + '-' + mediaId + '.' + ext - Caption empty:
'image-' + mediaId + '.' + ext - Extension from response
Content-Type(e.g.,image/jpeg→.jpg)
Examples:
2023-launchpad-kick-off-event-at-city-hall-3349.jpgimage-3127.jpg
The MediaID suffix is the stable disambiguator — re-imports with a changed caption produce a renamed file (git tracks as add+remove, content-hash unchanged so no actual blob duplication).
Fetch from https://<source-host>/media/<MediaID>/original, not the thumbnail endpoint. We're capturing the durable record; the SPA + future thumbnail service handle sizing.
Item\Media (182 occurrences):
- Compute filename from caption + mediaId + ext
- Emit
 - Add a
MediaAssetentry to the post's plan
Item\Embed (44 occurrences):
- Scan
DataHTML forhttps?://codeforphilly\.org/(thumbnail|media)/(\d+)/[^"' )]* - For each match: filename is
image-<mediaId>.<ext>(embeds don't have captions) - Rewrite the URL inline in the HTML
- Add a
MediaAssetentry to the plan - Third-party URLs (YouTube iframes etc.) are left alone
The translator stays sync. After all records translate:
- Aggregate every
{ slug, filename, sourceUrl }into a flat list. - Pre-fetch in parallel (with a configurable concurrency cap — default 4 — and the same politeness delay as JSON page fetches).
- Inside the existing
store.transact(...)callback (where blog-posts records are upserted): for each post, calltx['blog-posts'].setAttachments(record, { '<filename>': blobRef })then upsert as today.
BlobObject.write(hologit, bytes) hashes content into the git object DB — same pattern as the avatar-upload route. Idempotent against content hash (rerunning with the same bytes is a no-op).
Defensive map:
const EXT_BY_MIME: Record<string, string> = {
'image/jpeg': 'jpg',
'image/png': 'png',
'image/gif': 'gif',
'image/webp': 'webp',
'image/svg+xml': 'svg',
};Unknown content-type → warn + skip the asset (markdown link will 404, but the post itself imports). Survey of laddr's media shows JPEGs dominate — production data should be 99% covered.
apps/api/tests/import-laddr.test.ts:
- Translator returns a plan with the right
{ filename, sourceUrl }entries for a row with mixed Media + Embed items. - Caption slugification: long caption + special chars → cleaned slug.
- Empty caption falls back to
image-<id>. - Embed HTML URL rewrite: codeforphilly.org URLs become
/api/attachments/...; third-party URLs are untouched. - Orchestrator: mock fetch covers the binary
/media/<id>/originalendpoints; after import, attachments exist on the tree underblog-posts/<slug>/<filename>.
- Every
Item\Mediareference in the importedblog-posts/*.mdfiles resolves to/api/attachments/blog-posts/<slug>/<filename>. - No
codeforphilly.org/(thumbnail|media)/...URLs remain in any blog-post body. - Attachment bytes land in the data repo (verified post-merge against the live pod).
- Filenames are human-readable when captions are present.
-
npm run type-check && npm run lint && npm testclean — 340 API + all web + shared tests pass. - Sandbox redeploy → re-import → merge to
published→ SPA renders blog posts with images served from the new pod.
- Import duration. ~215 binary fetches at ~150 ms each (serial) is ~30 sec added; with concurrency=4, ~10 sec. Fine.
- Repo size growth. ~215 originals × ~250 KB average ≈ 50 MB. Acceptable for a v1 corpus.
- Embed HTML correctness. Rewriting
<img src="...">inside arbitrary HTML via regex is fragile if the URL appears in a weird context (alt text, data-* attributes). Spot-checked production embeds — all references appear insrc="..."attributes inside<img>tags. Acceptable risk; fragile-by-spec but pragmatic. - Hot-reload sees the new attachments. The runtime store reads attachments by their git path; once the new commit lands on
publishedand the webhook fires, the next/api/attachments/...request resolves against the new tree. No special index work needed.
Two commits: plan-open, impl + tests.
Surprises:
- Translator return shape carried a real refactor. Going from
translateBlogPost(): BlogPost | nullto(): { record, mediaAssets } | nullrippled into the orchestrator's call site + 9 test assertions. The.record.prefix everywhere is a bit verbose; future-me may want a destructured{ record: bp, ... }alias at the top of each test. Worth flagging if a similar refactor is needed for project-buzz. ?include=*returns 28 fields per row vs. 17 without. Mostly Author/Creator/Modifier expansions (the polymorphic identity refs) plus theitemsarray. The Zod schema just.passthrough()es them, so no shape work. But payload size doubles — 138 posts at ~30 KB each (was ~15 KB). Still trivial.- Filename collisions don't happen. Each post has its own subdir. Same MediaID across two different posts produces two attachments (one per owner) — the git object DB dedupes the bytes by content hash, so the actual repo cost is metadata overhead per reference, not bytes.
- Placeholder substitution via
String.split().join(). Picked over regex because the placeholder stringcfp-media:<id>is a literal — no regex-escape concern, andsplit-joinis O(n) and always-safe.
- Runtime thumbnail service — currently a 200×200 blog index card pulls a full 2MB original. Tracked as — #108.
- Wire
featuredImageKeyto use the same attachment scheme. The schema field exists but the importer doesn't surface it (laddr's JSON doesn't carry a "featured image" concept per blog post). If someone wants a hero image on the detail screen, they'd pick the firstItem\Mediafrom the body. None — let blog content authors set it explicitly post-cutover via a future CMS surface. - Lazy body loading. When post count grows past ~100 the
full-bodies-in-memory cost becomes worth reconsidering. Deferred
to plan —
#45already tracks this.