|
| 1 | +# Benchmark — traditional (text) RAG vs visual RAG |
| 2 | + |
| 3 | +A like-for-like comparison of the two retrieval paths in rupixel, on the **same |
| 4 | +documents, the same queries, and the same ground truth** — only the *modality* |
| 5 | +differs: |
| 6 | + |
| 7 | +- **Traditional / text RAG** — `all-MiniLM-L6-v2` (384-d) embeds each page's |
| 8 | + **extracted text**; the text query is matched against text vectors. |
| 9 | +- **Visual RAG** — `clip-vit-base-patch32` (512-d) embeds each page's **rendered |
| 10 | + screenshot**; the text query is matched against image vectors (cross-modal). |
| 11 | + |
| 12 | +> **Honesty up front:** this corpus is small (8 documents) and **text-clean** |
| 13 | +> (Wikipedia articles with a good text layer and topically distinct subjects). |
| 14 | +> On data like this, *both* paths are expected to do well — and they do. This |
| 15 | +> benchmark is here to show the comparison is **real and reproducible**, and to |
| 16 | +> be honest about *where each modality actually wins*, not to manufacture a gap. |
| 17 | +
|
| 18 | +## Setup |
| 19 | + |
| 20 | +- **Corpus:** 8 documents across 8 distinct topics (black holes, French |
| 21 | + Revolution, photosynthesis, espresso, TCP/IP, baroque music, sunflowers, the |
| 22 | + Great Barrier Reef). Each exists in **both** modalities: |
| 23 | + text in `tests/fixtures/pixelrag/compare/text/tiles/*.txt`, screenshot in |
| 24 | + `tests/fixtures/pixelrag/visual/images/*.png` (rendered with `pixelrag-render`). |
| 25 | +- **Queries:** 8 paraphrase queries (one per topic) sharing meaning but little |
| 26 | + vocabulary with their target — so retrieval must be *semantic*, not keyword. |
| 27 | +- **Ground truth:** 1 relevant document per query. **Index:** ruvector HNSW. |
| 28 | +- **Embedders run on CPU/WASM** (no GPU): MiniLM and CLIP via the same |
| 29 | + transformers.js sidecars the demos use. |
| 30 | + |
| 31 | +## Results (measured) |
| 32 | + |
| 33 | +| Metric | Traditional text RAG (MiniLM) | Visual RAG (CLIP) | |
| 34 | +|---|---:|---:| |
| 35 | +| **top-1 accuracy** | **1.00** (8/8) | **1.00** (8/8)¹ | |
| 36 | +| recall@10 | 1.00 | 1.00 | |
| 37 | +| nDCG@10 | 1.00 | 1.00 | |
| 38 | +| MRR | 1.00 | 1.00 | |
| 39 | +| query latency p50 | 0.62 ms | 0.52 ms | |
| 40 | +| embedding dim | 384 | 512 | |
| 41 | +| model (quantized) | all-MiniLM-L6-v2 (~23 MB) | clip-vit-base-patch32 (~85 MB) | |
| 42 | +| input it needs | a clean **text layer** | a **rendered image** (pixels) | |
| 43 | +| pre-step required | text extraction / parse | page render (`pixelrag-render`) | |
| 44 | + |
| 45 | +¹ **8/8 with the native (sharp) image preprocessing used by the Rust bench.** The |
| 46 | +**in-browser** demo (canvas preprocessing) scores **7/8 top-1, MRR 0.94** — one |
| 47 | +near-tie, where *"a vibrant underwater coral ecosystem"* ranks the coral-reef |
| 48 | +page #2 behind photosynthesis (both green nature scenes; scores within 0.02). |
| 49 | +Same model, different image resampling → the tie flips. Reproduce in your browser |
| 50 | +at the [visual demo](https://ruvnet.github.io/rupixel/visual.html). |
| 51 | + |
| 52 | +## What this does — and doesn't — show |
| 53 | + |
| 54 | +**Accuracy ties here.** With distinct topics and a clean text layer, both |
| 55 | +modalities retrieve perfectly. Accuracy alone does **not** separate them on this |
| 56 | +corpus, and we don't pretend it does. |
| 57 | + |
| 58 | +**The real trade-off is qualitative:** |
| 59 | + |
| 60 | +| | Traditional text RAG | Visual RAG | |
| 61 | +|---|---|---| |
| 62 | +| Needs a usable text layer | **Yes** — breaks on scans, image-only PDFs, screenshots, charts | **No** — reads pixels directly | |
| 63 | +| Preserves layout / tables / figures | No — flattened to a token stream | **Yes** — the page *is* the input | |
| 64 | +| Fine-grained text understanding | **Strong** | Weaker (CLIP ViT-B/32 is a baseline) | |
| 65 | +| Cost per doc | text parse (cheap) | render + larger model (heavier) | |
| 66 | + |
| 67 | +So: **traditional RAG is the right default for clean, text-rich documents** — |
| 68 | +it's cheap, fast, and strong. **Visual RAG earns its keep where text extraction |
| 69 | +fails or loses structure** — scanned documents, complex layouts, tables, charts, |
| 70 | +forms — which *this* corpus deliberately does not stress. |
| 71 | + |
| 72 | +## Where visual RAG should win (next benchmark) |
| 73 | + |
| 74 | +The honest next step is a corpus that breaks text extraction: scanned/image-only |
| 75 | +pages, multi-column layouts, table- and chart-heavy documents. There, text RAG |
| 76 | +degrades (or returns nothing) while visual RAG still retrieves. A |
| 77 | +document-specialized visual encoder (**Qwen3-VL / ColPali**, GPU) would also lift |
| 78 | +the visual numbers well above the CLIP-baseline used here. That comparison is |
| 79 | +tracked as future work — we report only what we have measured. |
| 80 | + |
| 81 | +## Reproduce |
| 82 | + |
| 83 | +```bash |
| 84 | +# from a ruvector checkout that includes the pixelrag crates |
| 85 | +( cd crates/pixelrag-cli/sidecar && npm install ) # MiniLM + CLIP sidecars |
| 86 | + |
| 87 | +# Traditional text RAG (MiniLM over extracted page text) |
| 88 | +cargo run -p pixelrag-cli -- benchmark --mode text --embedder real \ |
| 89 | + --ground-truth tests/fixtures/pixelrag/compare/text/ground-truth.json \ |
| 90 | + --queries tests/fixtures/pixelrag/compare/text/queries.json \ |
| 91 | + --tiles tests/fixtures/pixelrag/compare/text/tiles \ |
| 92 | + --metrics ndcg,mrr,recall@10 --index-backend hnsw |
| 93 | + |
| 94 | +# Visual RAG (CLIP over rendered screenshots, same 8 docs/queries) |
| 95 | +cargo run -p pixelrag-cli -- benchmark --mode visual --index-backend hnsw |
| 96 | +``` |
| 97 | + |
| 98 | +Both write JSON reports to `bench_output/`. The visual demo |
| 99 | +([visual.html](https://ruvnet.github.io/rupixel/visual.html)) and text demo |
| 100 | +([index.html](https://ruvnet.github.io/rupixel/)) run the same two models live in |
| 101 | +your browser. |
0 commit comments