docs: traditional vs visual RAG benchmark (BENCHMARK.md) + README comparison

ruvnet · ruvnet · commit 4cd6df1f4a06 · 2026-06-26T10:40:23.000-04:00
Same 8 docs / 8 queries, two modalities: MiniLM over text vs CLIP over screenshots.
Both 1.00 top-1 on this clean corpus (honest: ties here; visual wins on layout/
scanned docs — future work). README gains a comparison table linking docs/BENCHMARK.md.

Co-Authored-By: claude-flow &lt;ruv@ruv.net&gt;
diff --git a/README.md b/README.md
@@ -147,6 +147,34 @@ query ──embed──▶ vector ──────────search───
 
 ---
 
+## Benchmark: traditional (text) RAG vs visual RAG
+
+Same 8 documents, same 8 paraphrase queries, same ground truth — only the
+*modality* differs: **MiniLM over each page's extracted text** vs **CLIP over each
+page's rendered screenshot**. Full method, caveats, and reproduce commands in
+**[`docs/BENCHMARK.md`](./docs/BENCHMARK.md)**.
+
+| Metric | Traditional text RAG (MiniLM) | Visual RAG (CLIP) |
+|---|---:|---:|
+| top-1 accuracy | **1.00** (8/8) | **1.00** (8/8)¹ |
+| nDCG@10 / MRR | 1.00 / 1.00 | 1.00 / 1.00 |
+| query latency p50 | 0.62 ms | 0.52 ms |
+| embedding dim | 384 | 512 |
+| needs | a clean **text layer** | a **rendered image** |
+
+¹ 8/8 with native (sharp) preprocessing; the **in-browser** demo (canvas) is **7/8,
+MRR 0.94** — one near-tie.
+
+**Honest reading:** on this small, **text-clean** corpus *both* paths retrieve
+perfectly — accuracy doesn't separate them. The real trade-off is qualitative:
+traditional RAG is the cheap, strong default for **text-rich** documents; visual
+RAG earns its keep where text extraction **fails or loses structure** (scans,
+complex layouts, tables, charts) — which this corpus deliberately doesn't stress.
+A document-specialized visual model (Qwen3-VL / ColPali, GPU) would lift the
+visual numbers above the CLIP baseline. See [`docs/BENCHMARK.md`](./docs/BENCHMARK.md).
+
+---
+
 ## Benchmark harness (metaharness / darwin)
 
 The benchmark suite is **darwin-generated** (`.metaharness/bench.json`) and
diff --git a/docs/BENCHMARK.md b/docs/BENCHMARK.md
@@ -0,0 +1,101 @@
+# Benchmark — traditional (text) RAG vs visual RAG
+
+A like-for-like comparison of the two retrieval paths in rupixel, on the **same
+documents, the same queries, and the same ground truth** — only the *modality*
+differs:
+
+- **Traditional / text RAG** — `all-MiniLM-L6-v2` (384-d) embeds each page's
+  **extracted text**; the text query is matched against text vectors.
+- **Visual RAG** — `clip-vit-base-patch32` (512-d) embeds each page's **rendered
+  screenshot**; the text query is matched against image vectors (cross-modal).
+
+> **Honesty up front:** this corpus is small (8 documents) and **text-clean**
+> (Wikipedia articles with a good text layer and topically distinct subjects).
+> On data like this, *both* paths are expected to do well — and they do. This
+> benchmark is here to show the comparison is **real and reproducible**, and to
+> be honest about *where each modality actually wins*, not to manufacture a gap.
+
+## Setup
+
+- **Corpus:** 8 documents across 8 distinct topics (black holes, French
+  Revolution, photosynthesis, espresso, TCP/IP, baroque music, sunflowers, the
+  Great Barrier Reef). Each exists in **both** modalities:
+  text in `tests/fixtures/pixelrag/compare/text/tiles/*.txt`, screenshot in
+  `tests/fixtures/pixelrag/visual/images/*.png` (rendered with `pixelrag-render`).
+- **Queries:** 8 paraphrase queries (one per topic) sharing meaning but little
+  vocabulary with their target — so retrieval must be *semantic*, not keyword.
+- **Ground truth:** 1 relevant document per query. **Index:** ruvector HNSW.
+- **Embedders run on CPU/WASM** (no GPU): MiniLM and CLIP via the same
+  transformers.js sidecars the demos use.
+
+## Results (measured)
+
+| Metric | Traditional text RAG (MiniLM) | Visual RAG (CLIP) |
+|---|---:|---:|
+| **top-1 accuracy** | **1.00** (8/8) | **1.00** (8/8)¹ |
+| recall@10 | 1.00 | 1.00 |
+| nDCG@10 | 1.00 | 1.00 |
+| MRR | 1.00 | 1.00 |
+| query latency p50 | 0.62 ms | 0.52 ms |
+| embedding dim | 384 | 512 |
+| model (quantized) | all-MiniLM-L6-v2 (~23 MB) | clip-vit-base-patch32 (~85 MB) |
+| input it needs | a clean **text layer** | a **rendered image** (pixels) |
+| pre-step required | text extraction / parse | page render (`pixelrag-render`) |
+
+¹ **8/8 with the native (sharp) image preprocessing used by the Rust bench.** The
+**in-browser** demo (canvas preprocessing) scores **7/8 top-1, MRR 0.94** — one
+near-tie, where *"a vibrant underwater coral ecosystem"* ranks the coral-reef
+page #2 behind photosynthesis (both green nature scenes; scores within 0.02).
+Same model, different image resampling → the tie flips. Reproduce in your browser
+at the [visual demo](https://ruvnet.github.io/rupixel/visual.html).
+
+## What this does — and doesn't — show
+
+**Accuracy ties here.** With distinct topics and a clean text layer, both
+modalities retrieve perfectly. Accuracy alone does **not** separate them on this
+corpus, and we don't pretend it does.
+
+**The real trade-off is qualitative:**
+
+| | Traditional text RAG | Visual RAG |
+|---|---|---|
+| Needs a usable text layer | **Yes** — breaks on scans, image-only PDFs, screenshots, charts | **No** — reads pixels directly |
+| Preserves layout / tables / figures | No — flattened to a token stream | **Yes** — the page *is* the input |
+| Fine-grained text understanding | **Strong** | Weaker (CLIP ViT-B/32 is a baseline) |
+| Cost per doc | text parse (cheap) | render + larger model (heavier) |
+
+So: **traditional RAG is the right default for clean, text-rich documents** —
+it's cheap, fast, and strong. **Visual RAG earns its keep where text extraction
+fails or loses structure** — scanned documents, complex layouts, tables, charts,
+forms — which *this* corpus deliberately does not stress.
+
+## Where visual RAG should win (next benchmark)
+
+The honest next step is a corpus that breaks text extraction: scanned/image-only
+pages, multi-column layouts, table- and chart-heavy documents. There, text RAG
+degrades (or returns nothing) while visual RAG still retrieves. A
+document-specialized visual encoder (**Qwen3-VL / ColPali**, GPU) would also lift
+the visual numbers well above the CLIP-baseline used here. That comparison is
+tracked as future work — we report only what we have measured.
+
+## Reproduce
+
+```bash
+# from a ruvector checkout that includes the pixelrag crates
+( cd crates/pixelrag-cli/sidecar && npm install )   # MiniLM + CLIP sidecars
+
+# Traditional text RAG (MiniLM over extracted page text)
+cargo run -p pixelrag-cli -- benchmark --mode text --embedder real \
+  --ground-truth tests/fixtures/pixelrag/compare/text/ground-truth.json \
+  --queries      tests/fixtures/pixelrag/compare/text/queries.json \
+  --tiles        tests/fixtures/pixelrag/compare/text/tiles \
+  --metrics ndcg,mrr,recall@10 --index-backend hnsw
+
+# Visual RAG (CLIP over rendered screenshots, same 8 docs/queries)
+cargo run -p pixelrag-cli -- benchmark --mode visual --index-backend hnsw
+```
+
+Both write JSON reports to `bench_output/`. The visual demo
+([visual.html](https://ruvnet.github.io/rupixel/visual.html)) and text demo
+([index.html](https://ruvnet.github.io/rupixel/)) run the same two models live in
+your browser.
diff --git a/tests/fixtures/pixelrag/compare/text/ground-truth.json b/tests/fixtures/pixelrag/compare/text/ground-truth.json
@@ -0,0 +1,54 @@
+{
+  "dataset": "pixelrag-compare-8doc",
+  "k": 10,
+  "relevance": [
+    {
+      "query_id": "vq1",
+      "relevant": [
+        "doc-00"
+      ]
+    },
+    {
+      "query_id": "vq2",
+      "relevant": [
+        "doc-01"
+      ]
+    },
+    {
+      "query_id": "vq3",
+      "relevant": [
+        "doc-02"
+      ]
+    },
+    {
+      "query_id": "vq4",
+      "relevant": [
+        "doc-03"
+      ]
+    },
+    {
+      "query_id": "vq5",
+      "relevant": [
+        "doc-04"
+      ]
+    },
+    {
+      "query_id": "vq6",
+      "relevant": [
+        "doc-05"
+      ]
+    },
+    {
+      "query_id": "vq7",
+      "relevant": [
+        "doc-06"
+      ]
+    },
+    {
+      "query_id": "vq8",
+      "relevant": [
+        "doc-07"
+      ]
+    }
+  ]
+}
diff --git a/tests/fixtures/pixelrag/compare/text/queries.json b/tests/fixtures/pixelrag/compare/text/queries.json
@@ -0,0 +1,36 @@
+{
+  "queries": [
+    {
+      "query_id": "vq1",
+      "text": "the unseen monster lurking at a galaxy's center"
+    },
+    {
+      "query_id": "vq2",
+      "text": "the storming of the Bastille and the guillotine"
+    },
+    {
+      "query_id": "vq3",
+      "text": "how green plants turn sunlight into chemical energy"
+    },
+    {
+      "query_id": "vq4",
+      "text": "a small strong shot of Italian coffee with crema"
+    },
+    {
+      "query_id": "vq5",
+      "text": "how computers split data into packets across a network"
+    },
+    {
+      "query_id": "vq6",
+      "text": "ornate, dramatic 17th-century classical music"
+    },
+    {
+      "query_id": "vq7",
+      "text": "a tall yellow flower that follows the sun"
+    },
+    {
+      "query_id": "vq8",
+      "text": "a vibrant underwater coral ecosystem"
+    }
+  ]
+}
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-00.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-00.txt
@@ -0,0 +1,5 @@
+A black hole is an astronomical body so compact that its gravity prevents anything, including light, from escaping. Albert Einstein's theory of general relativity, which describes gravitation as the curvature of spacetime, predicts that any sufficiently compact mass will form a black hole. The boundary of no escape is called the event horizon. In general relativity, crossing a black hole's event horizon traps an object inside but produces no locally detectable change. General relativity also predicts that every black hole should have a central singularity, where the curvature of spacetime is infinite.
+
+Objects whose gravitational fields are too strong for light to escape were first considered in the 18th century. In 1916, the first solution of general relativity that would characterise a black hole was found. By the late 1950s, this solution began to be interpreted physically as a region of space from which nothing can escape. Black holes were long considered a mathematical curiosity; it was not until the 1960s that theoretical work showed they were a generic prediction of general relativity. The first widely accepted black hole was Cygnus X-1, identified by several researchers independently in 1971.
+
+Black holes typically form when very massive stars collapse at the end of their life cycle. After a black hole has formed, it can grow by absorbing mass from its surroundings. Supermassive black holes of millions of solar masses may form by absorbing stars and merging with other black holes, or via direct collapse of gas clouds. There is consensus that supermassive black holes
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-01.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-01.txt
@@ -0,0 +1,7 @@
+The French Revolution[a] was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse. It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.
+
+Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614. The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June. The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.
+
+The next three years were dominated by a struggle for political control. King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792. As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.
+
+After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-02.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-02.txt
@@ -0,0 +1,3 @@
+Photosynthesis[note 1] is a system of biological processes by which photopigment-bearing autotrophic organisms, such as most plants, algae and cyanobacteria, convert light energy — typically from sunlight — into the chemical energy necessary to fuel their metabolism. The term photosynthesis usually refers to oxygenic photosynthesis, a process that releases oxygen as a byproduct of water splitting. Photosynthetic organisms store the converted chemical energy within the bonds of intracellular organic compounds (complex compounds containing carbon), typically carbohydrates like sugars (mainly glucose, fructose and sucrose), starches, phytoglycogen and cellulose. When needing to use this stored energy, an organism's cells then metabolize the organic compounds through cellular respiration. Photosynthesis plays a critical role in producing and maintaining the oxygen content of the Earth's atmosphere, and it supplies most of the biological energy necessary for complex life on Earth.
+
+Some organisms also perform anoxygenic photosynthesis, which does not produce oxygen. Some bacteria (e.g. purple bacteria) use bacteriochlorophyll to split hydrogen sulfide as a reductant instead of water, releasing sulfur instead of oxygen, which was a dominant form of photosynthesis in the euxinic Canfield oceans during the Boring Billion. Archaea such as Halobacterium also perform a type of non-carbon-fixing anoxygenic photosynthesis, where the simpler photopigment retinal and its microbial rhodopsin derivatives are used to absorb green light and produce a proton (hydron) gradient across the cell m
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-03.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-03.txt
@@ -0,0 +1,7 @@
+Espresso (/ɛˈsprɛsoʊ/ ⓘ, Italian: [eˈsprɛsso]) is a concentrated form of coffee produced by forcing hot water under high pressure through finely ground coffee beans. Originating in Italy, espresso has become one of the most popular coffee-brewing methods worldwide. It is characterized by its small serving size, typically 25–30 ml, and its distinctive layers: a dark body topped with a lighter-colored foam called "crema".
+
+Espresso machines use pressure to extract a highly concentrated coffee with a complex flavor profile in a short time, usually 25–30 seconds. The result is a beverage with a higher concentration of suspended and dissolved solids than regular drip coffee, giving espresso its characteristic body and intensity. While espresso contains more caffeine per unit volume than most coffee beverages, its typical serving size results in less caffeine per serving compared to larger drinks such as drip coffee.
+
+Espresso serves as the base for other coffee drinks, including cappuccino, caffè latte, and americano. It can be made with various types of coffee beans and roast levels, allowing for a wide range of flavors and strengths, despite the widespread myth that it is made with dark-roast coffee beans. The quality of an espresso is influenced by factors such as the grind size, water temperature, pressure, and the barista's skill in tamping (packing and leveling) the coffee grounds.
+
+The cultural significance of espresso extends beyond its consumption, playing a central role in coffee shop culture and the third-wave coffee movement, which emphasizes artisanal production and
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-04.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-04.txt
@@ -0,0 +1,5 @@
+The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suite are the Transmission Control Protocol (TCP), the User Datagram Protocol (UDP), and the Internet Protocol (IP). Early versions of this networking model were known as the Department of Defense (DoD) Internet Architecture Model because the research and development were funded by the Defense Advanced Research Projects Agency (DARPA) of the United States Department of Defense.
+
+The Internet protocol suite provides end-to-end data communication specifying how data should be packetized, addressed, transmitted, routed, and received. This functionality is organized into four abstraction layers, which classify all related protocols according to each protocol's scope of networking. An implementation of the layers for a particular application forms a protocol stack. From lowest to highest, the layers are the link layer, containing communication methods for data that remains within a single network segment (link); the internet layer, providing internetworking between independent networks; the transport layer, handling host-to-host communication; and the application layer, providing process-to-process data exchange for applications.
+
+The technical standards underlying the Internet protocol suite and its constituent protocols are maintained by the Internet Engineering Task Force (IETF). The Internet protocol suite predates the OSI model, a more comprehens
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-05.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-05.txt
@@ -0,0 +1,5 @@
+Baroque music (UK: /bəˈrɒk/ or US: /bəˈroʊk/) refers to the period or dominant style of Western classical music composed from about 1600 to 1750. The Baroque style followed the Renaissance period, and was followed in turn by the Classical period after a short transition (the galant style). Baroque music forms a major portion of the "classical music" canon, and continues to be widely studied, performed, and listened to. Key composers of the Baroque era include Jacopo Peri, who wrote the first operas; Alessandro Stradella, who originated the concerto grosso style; and Arcangelo Corelli, who was one of the first composers to publish widely and have his music performed across Europe.
+
+The Baroque period saw the formalization of common-practice tonality, an approach to writing music in which a song or piece is written in a particular key; this type of harmony has continued to be used extensively in Western classical and popular music. Baroque composers experimented with finding a fuller sound for each instrumental part, leading to the creation of the modern orchestra; modernised musical notation, including developing figured bass; and developed new instrumental playing techniques. Baroque music expanded the size, range, and complexity of instrumental performance, and also established the mixed vocal/instrumental forms of opera, cantata and oratorio and the instrumental forms of the solo concerto and sonata as musical genres. Dense, complex polyphonic music, in which multiple independent melody lines were performed simultaneously.
+
+During the Baroque era professional musicians we
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-06.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-06.txt
@@ -0,0 +1,5 @@
+The common sunflower (Helianthus annuus) is a large annual forb in the daisy family Asteraceae. The domesticated form of common sunflower is harvested for its edible seeds, which come in two types: oil and confectionary seeds. Oilseed sunflowers are widely grown globally and represent the fourth most used vegetable oil in the world. They also are used widely as bird food or as food for livestock. In contrast, confectionary sunflower seeds are often eaten as a snack food or in baking. There also are horticultural sunflower varieties that are used as plantings in domestic gardens for aesthetics. Wild plants are known for their multiple flower heads, whereas the domestic sunflower often possesses a single large flower head atop an unbranched stem.
+
+The plant has an erect rough-hairy stem, reaching typical heights of 3 metres (10 feet). The tallest sunflower on record achieved 10.9 m (35 ft 9 in). Sunflower leaves are broad, coarsely toothed, rough and mostly alternate; those near the bottom are largest and commonly heart-shaped.
+
+The plant flowers in summer. What is often called the "flower" of the sunflower is actually a "flower head" (pseudanthium), 7.5–12.5 centimetres (3–5 in) wide, of numerous small individual five-petaled flowers ("florets"). The outer flowers, which resemble petals, are called ray flowers. Each "petal" consists of a ligule composed of fused petals of an asymmetrical ray flower. They are sexually sterile and may be yellow, red, orange, or other colors. The spirally arranged flowers in the center of the head are called disk flowers. These mature into frui
diff --git a/tests/fixtures/pixelrag/compare/text/tiles/doc-07.txt b/tests/fixtures/pixelrag/compare/text/tiles/doc-07.txt

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+Photosynthesis[note 1] is a system of biological processes by which photopigment-bearing autotrophic organisms, such as most plants, algae and cyanobacteria, convert light energy — typically from sunlight — into the chemical energy necessary to fuel their metabolism. The term photosynthesis usually refers to oxygenic photosynthesis, a process that releases oxygen as a byproduct of water splitting. Photosynthetic organisms store the converted chemical energy within the bonds of intracellular organic compounds (complex compounds containing carbon), typically carbohydrates like sugars (mainly glucose, fructose and sucrose), starches, phytoglycogen and cellulose. When needing to use this stored energy, an organism's cells then metabolize the organic compounds through cellular respiration. Photosynthesis plays a critical role in producing and maintaining the oxygen content of the Earth's atmosphere, and it supplies most of the biological energy necessary for complex life on Earth.
	`2`	`+`
	`3`	+Some organisms also perform anoxygenic photosynthesis, which does not produce oxygen. Some bacteria (e.g. purple bacteria) use bacteriochlorophyll to split hydrogen sulfide as a reductant instead of water, releasing sulfur instead of oxygen, which was a dominant form of photosynthesis in the euxinic Canfield oceans during the Boring Billion. Archaea such as Halobacterium also perform a type of non-carbon-fixing anoxygenic photosynthesis, where the simpler photopigment retinal and its microbial rhodopsin derivatives are used to absorb green light and produce a proton (hydron) gradient across the cell m