Skip to content

Commit 4cd6df1

Browse files
committed
docs: traditional vs visual RAG benchmark (BENCHMARK.md) + README comparison
Same 8 docs / 8 queries, two modalities: MiniLM over text vs CLIP over screenshots. Both 1.00 top-1 on this clean corpus (honest: ties here; visual wins on layout/ scanned docs — future work). README gains a comparison table linking docs/BENCHMARK.md. Co-Authored-By: claude-flow <ruv@ruv.net>
1 parent a6e1aa9 commit 4cd6df1

12 files changed

Lines changed: 259 additions & 0 deletions

File tree

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,34 @@ query ──embed──▶ vector ──────────search───
147147

148148
---
149149

150+
## Benchmark: traditional (text) RAG vs visual RAG
151+
152+
Same 8 documents, same 8 paraphrase queries, same ground truth — only the
153+
*modality* differs: **MiniLM over each page's extracted text** vs **CLIP over each
154+
page's rendered screenshot**. Full method, caveats, and reproduce commands in
155+
**[`docs/BENCHMARK.md`](./docs/BENCHMARK.md)**.
156+
157+
| Metric | Traditional text RAG (MiniLM) | Visual RAG (CLIP) |
158+
|---|---:|---:|
159+
| top-1 accuracy | **1.00** (8/8) | **1.00** (8/8)¹ |
160+
| nDCG@10 / MRR | 1.00 / 1.00 | 1.00 / 1.00 |
161+
| query latency p50 | 0.62 ms | 0.52 ms |
162+
| embedding dim | 384 | 512 |
163+
| needs | a clean **text layer** | a **rendered image** |
164+
165+
¹ 8/8 with native (sharp) preprocessing; the **in-browser** demo (canvas) is **7/8,
166+
MRR 0.94** — one near-tie.
167+
168+
**Honest reading:** on this small, **text-clean** corpus *both* paths retrieve
169+
perfectly — accuracy doesn't separate them. The real trade-off is qualitative:
170+
traditional RAG is the cheap, strong default for **text-rich** documents; visual
171+
RAG earns its keep where text extraction **fails or loses structure** (scans,
172+
complex layouts, tables, charts) — which this corpus deliberately doesn't stress.
173+
A document-specialized visual model (Qwen3-VL / ColPali, GPU) would lift the
174+
visual numbers above the CLIP baseline. See [`docs/BENCHMARK.md`](./docs/BENCHMARK.md).
175+
176+
---
177+
150178
## Benchmark harness (metaharness / darwin)
151179

152180
The benchmark suite is **darwin-generated** (`.metaharness/bench.json`) and

docs/BENCHMARK.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Benchmark — traditional (text) RAG vs visual RAG
2+
3+
A like-for-like comparison of the two retrieval paths in rupixel, on the **same
4+
documents, the same queries, and the same ground truth** — only the *modality*
5+
differs:
6+
7+
- **Traditional / text RAG**`all-MiniLM-L6-v2` (384-d) embeds each page's
8+
**extracted text**; the text query is matched against text vectors.
9+
- **Visual RAG**`clip-vit-base-patch32` (512-d) embeds each page's **rendered
10+
screenshot**; the text query is matched against image vectors (cross-modal).
11+
12+
> **Honesty up front:** this corpus is small (8 documents) and **text-clean**
13+
> (Wikipedia articles with a good text layer and topically distinct subjects).
14+
> On data like this, *both* paths are expected to do well — and they do. This
15+
> benchmark is here to show the comparison is **real and reproducible**, and to
16+
> be honest about *where each modality actually wins*, not to manufacture a gap.
17+
18+
## Setup
19+
20+
- **Corpus:** 8 documents across 8 distinct topics (black holes, French
21+
Revolution, photosynthesis, espresso, TCP/IP, baroque music, sunflowers, the
22+
Great Barrier Reef). Each exists in **both** modalities:
23+
text in `tests/fixtures/pixelrag/compare/text/tiles/*.txt`, screenshot in
24+
`tests/fixtures/pixelrag/visual/images/*.png` (rendered with `pixelrag-render`).
25+
- **Queries:** 8 paraphrase queries (one per topic) sharing meaning but little
26+
vocabulary with their target — so retrieval must be *semantic*, not keyword.
27+
- **Ground truth:** 1 relevant document per query. **Index:** ruvector HNSW.
28+
- **Embedders run on CPU/WASM** (no GPU): MiniLM and CLIP via the same
29+
transformers.js sidecars the demos use.
30+
31+
## Results (measured)
32+
33+
| Metric | Traditional text RAG (MiniLM) | Visual RAG (CLIP) |
34+
|---|---:|---:|
35+
| **top-1 accuracy** | **1.00** (8/8) | **1.00** (8/8)¹ |
36+
| recall@10 | 1.00 | 1.00 |
37+
| nDCG@10 | 1.00 | 1.00 |
38+
| MRR | 1.00 | 1.00 |
39+
| query latency p50 | 0.62 ms | 0.52 ms |
40+
| embedding dim | 384 | 512 |
41+
| model (quantized) | all-MiniLM-L6-v2 (~23 MB) | clip-vit-base-patch32 (~85 MB) |
42+
| input it needs | a clean **text layer** | a **rendered image** (pixels) |
43+
| pre-step required | text extraction / parse | page render (`pixelrag-render`) |
44+
45+
¹ **8/8 with the native (sharp) image preprocessing used by the Rust bench.** The
46+
**in-browser** demo (canvas preprocessing) scores **7/8 top-1, MRR 0.94** — one
47+
near-tie, where *"a vibrant underwater coral ecosystem"* ranks the coral-reef
48+
page #2 behind photosynthesis (both green nature scenes; scores within 0.02).
49+
Same model, different image resampling → the tie flips. Reproduce in your browser
50+
at the [visual demo](https://ruvnet.github.io/rupixel/visual.html).
51+
52+
## What this does — and doesn't — show
53+
54+
**Accuracy ties here.** With distinct topics and a clean text layer, both
55+
modalities retrieve perfectly. Accuracy alone does **not** separate them on this
56+
corpus, and we don't pretend it does.
57+
58+
**The real trade-off is qualitative:**
59+
60+
| | Traditional text RAG | Visual RAG |
61+
|---|---|---|
62+
| Needs a usable text layer | **Yes** — breaks on scans, image-only PDFs, screenshots, charts | **No** — reads pixels directly |
63+
| Preserves layout / tables / figures | No — flattened to a token stream | **Yes** — the page *is* the input |
64+
| Fine-grained text understanding | **Strong** | Weaker (CLIP ViT-B/32 is a baseline) |
65+
| Cost per doc | text parse (cheap) | render + larger model (heavier) |
66+
67+
So: **traditional RAG is the right default for clean, text-rich documents**
68+
it's cheap, fast, and strong. **Visual RAG earns its keep where text extraction
69+
fails or loses structure** — scanned documents, complex layouts, tables, charts,
70+
forms — which *this* corpus deliberately does not stress.
71+
72+
## Where visual RAG should win (next benchmark)
73+
74+
The honest next step is a corpus that breaks text extraction: scanned/image-only
75+
pages, multi-column layouts, table- and chart-heavy documents. There, text RAG
76+
degrades (or returns nothing) while visual RAG still retrieves. A
77+
document-specialized visual encoder (**Qwen3-VL / ColPali**, GPU) would also lift
78+
the visual numbers well above the CLIP-baseline used here. That comparison is
79+
tracked as future work — we report only what we have measured.
80+
81+
## Reproduce
82+
83+
```bash
84+
# from a ruvector checkout that includes the pixelrag crates
85+
( cd crates/pixelrag-cli/sidecar && npm install ) # MiniLM + CLIP sidecars
86+
87+
# Traditional text RAG (MiniLM over extracted page text)
88+
cargo run -p pixelrag-cli -- benchmark --mode text --embedder real \
89+
--ground-truth tests/fixtures/pixelrag/compare/text/ground-truth.json \
90+
--queries tests/fixtures/pixelrag/compare/text/queries.json \
91+
--tiles tests/fixtures/pixelrag/compare/text/tiles \
92+
--metrics ndcg,mrr,recall@10 --index-backend hnsw
93+
94+
# Visual RAG (CLIP over rendered screenshots, same 8 docs/queries)
95+
cargo run -p pixelrag-cli -- benchmark --mode visual --index-backend hnsw
96+
```
97+
98+
Both write JSON reports to `bench_output/`. The visual demo
99+
([visual.html](https://ruvnet.github.io/rupixel/visual.html)) and text demo
100+
([index.html](https://ruvnet.github.io/rupixel/)) run the same two models live in
101+
your browser.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
{
2+
"dataset": "pixelrag-compare-8doc",
3+
"k": 10,
4+
"relevance": [
5+
{
6+
"query_id": "vq1",
7+
"relevant": [
8+
"doc-00"
9+
]
10+
},
11+
{
12+
"query_id": "vq2",
13+
"relevant": [
14+
"doc-01"
15+
]
16+
},
17+
{
18+
"query_id": "vq3",
19+
"relevant": [
20+
"doc-02"
21+
]
22+
},
23+
{
24+
"query_id": "vq4",
25+
"relevant": [
26+
"doc-03"
27+
]
28+
},
29+
{
30+
"query_id": "vq5",
31+
"relevant": [
32+
"doc-04"
33+
]
34+
},
35+
{
36+
"query_id": "vq6",
37+
"relevant": [
38+
"doc-05"
39+
]
40+
},
41+
{
42+
"query_id": "vq7",
43+
"relevant": [
44+
"doc-06"
45+
]
46+
},
47+
{
48+
"query_id": "vq8",
49+
"relevant": [
50+
"doc-07"
51+
]
52+
}
53+
]
54+
}
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{
2+
"queries": [
3+
{
4+
"query_id": "vq1",
5+
"text": "the unseen monster lurking at a galaxy's center"
6+
},
7+
{
8+
"query_id": "vq2",
9+
"text": "the storming of the Bastille and the guillotine"
10+
},
11+
{
12+
"query_id": "vq3",
13+
"text": "how green plants turn sunlight into chemical energy"
14+
},
15+
{
16+
"query_id": "vq4",
17+
"text": "a small strong shot of Italian coffee with crema"
18+
},
19+
{
20+
"query_id": "vq5",
21+
"text": "how computers split data into packets across a network"
22+
},
23+
{
24+
"query_id": "vq6",
25+
"text": "ornate, dramatic 17th-century classical music"
26+
},
27+
{
28+
"query_id": "vq7",
29+
"text": "a tall yellow flower that follows the sun"
30+
},
31+
{
32+
"query_id": "vq8",
33+
"text": "a vibrant underwater coral ecosystem"
34+
}
35+
]
36+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
A black hole is an astronomical body so compact that its gravity prevents anything, including light, from escaping. Albert Einstein's theory of general relativity, which describes gravitation as the curvature of spacetime, predicts that any sufficiently compact mass will form a black hole. The boundary of no escape is called the event horizon. In general relativity, crossing a black hole's event horizon traps an object inside but produces no locally detectable change. General relativity also predicts that every black hole should have a central singularity, where the curvature of spacetime is infinite.
2+
3+
Objects whose gravitational fields are too strong for light to escape were first considered in the 18th century. In 1916, the first solution of general relativity that would characterise a black hole was found. By the late 1950s, this solution began to be interpreted physically as a region of space from which nothing can escape. Black holes were long considered a mathematical curiosity; it was not until the 1960s that theoretical work showed they were a generic prediction of general relativity. The first widely accepted black hole was Cygnus X-1, identified by several researchers independently in 1971.
4+
5+
Black holes typically form when very massive stars collapse at the end of their life cycle. After a black hole has formed, it can grow by absorbing mass from its surroundings. Supermassive black holes of millions of solar masses may form by absorbing stars and merging with other black holes, or via direct collapse of gas clouds. There is consensus that supermassive black holes
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
The French Revolution[a] was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse. It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.
2+
3+
Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614. The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June. The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.
4+
5+
The next three years were dominated by a struggle for political control. King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792. As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.
6+
7+
After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Photosynthesis[note 1] is a system of biological processes by which photopigment-bearing autotrophic organisms, such as most plants, algae and cyanobacteria, convert light energy — typically from sunlight — into the chemical energy necessary to fuel their metabolism. The term photosynthesis usually refers to oxygenic photosynthesis, a process that releases oxygen as a byproduct of water splitting. Photosynthetic organisms store the converted chemical energy within the bonds of intracellular organic compounds (complex compounds containing carbon), typically carbohydrates like sugars (mainly glucose, fructose and sucrose), starches, phytoglycogen and cellulose. When needing to use this stored energy, an organism's cells then metabolize the organic compounds through cellular respiration. Photosynthesis plays a critical role in producing and maintaining the oxygen content of the Earth's atmosphere, and it supplies most of the biological energy necessary for complex life on Earth.
2+
3+
Some organisms also perform anoxygenic photosynthesis, which does not produce oxygen. Some bacteria (e.g. purple bacteria) use bacteriochlorophyll to split hydrogen sulfide as a reductant instead of water, releasing sulfur instead of oxygen, which was a dominant form of photosynthesis in the euxinic Canfield oceans during the Boring Billion. Archaea such as Halobacterium also perform a type of non-carbon-fixing anoxygenic photosynthesis, where the simpler photopigment retinal and its microbial rhodopsin derivatives are used to absorb green light and produce a proton (hydron) gradient across the cell m
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Espresso (/ɛˈsprɛsoʊ/ ⓘ, Italian: [eˈsprɛsso]) is a concentrated form of coffee produced by forcing hot water under high pressure through finely ground coffee beans. Originating in Italy, espresso has become one of the most popular coffee-brewing methods worldwide. It is characterized by its small serving size, typically 25–30 ml, and its distinctive layers: a dark body topped with a lighter-colored foam called "crema".
2+
3+
Espresso machines use pressure to extract a highly concentrated coffee with a complex flavor profile in a short time, usually 25–30 seconds. The result is a beverage with a higher concentration of suspended and dissolved solids than regular drip coffee, giving espresso its characteristic body and intensity. While espresso contains more caffeine per unit volume than most coffee beverages, its typical serving size results in less caffeine per serving compared to larger drinks such as drip coffee.
4+
5+
Espresso serves as the base for other coffee drinks, including cappuccino, caffè latte, and americano. It can be made with various types of coffee beans and roast levels, allowing for a wide range of flavors and strengths, despite the widespread myth that it is made with dark-roast coffee beans. The quality of an espresso is influenced by factors such as the grind size, water temperature, pressure, and the barista's skill in tamping (packing and leveling) the coffee grounds.
6+
7+
The cultural significance of espresso extends beyond its consumption, playing a central role in coffee shop culture and the third-wave coffee movement, which emphasizes artisanal production and
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suite are the Transmission Control Protocol (TCP), the User Datagram Protocol (UDP), and the Internet Protocol (IP). Early versions of this networking model were known as the Department of Defense (DoD) Internet Architecture Model because the research and development were funded by the Defense Advanced Research Projects Agency (DARPA) of the United States Department of Defense.
2+
3+
The Internet protocol suite provides end-to-end data communication specifying how data should be packetized, addressed, transmitted, routed, and received. This functionality is organized into four abstraction layers, which classify all related protocols according to each protocol's scope of networking. An implementation of the layers for a particular application forms a protocol stack. From lowest to highest, the layers are the link layer, containing communication methods for data that remains within a single network segment (link); the internet layer, providing internetworking between independent networks; the transport layer, handling host-to-host communication; and the application layer, providing process-to-process data exchange for applications.
4+
5+
The technical standards underlying the Internet protocol suite and its constituent protocols are maintained by the Internet Engineering Task Force (IETF). The Internet protocol suite predates the OSI model, a more comprehens
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Baroque music (UK: /bəˈrɒk/ or US: /bəˈroʊk/) refers to the period or dominant style of Western classical music composed from about 1600 to 1750. The Baroque style followed the Renaissance period, and was followed in turn by the Classical period after a short transition (the galant style). Baroque music forms a major portion of the "classical music" canon, and continues to be widely studied, performed, and listened to. Key composers of the Baroque era include Jacopo Peri, who wrote the first operas; Alessandro Stradella, who originated the concerto grosso style; and Arcangelo Corelli, who was one of the first composers to publish widely and have his music performed across Europe.
2+
3+
The Baroque period saw the formalization of common-practice tonality, an approach to writing music in which a song or piece is written in a particular key; this type of harmony has continued to be used extensively in Western classical and popular music. Baroque composers experimented with finding a fuller sound for each instrumental part, leading to the creation of the modern orchestra; modernised musical notation, including developing figured bass; and developed new instrumental playing techniques. Baroque music expanded the size, range, and complexity of instrumental performance, and also established the mixed vocal/instrumental forms of opera, cantata and oratorio and the instrumental forms of the solo concerto and sonata as musical genres. Dense, complex polyphonic music, in which multiple independent melody lines were performed simultaneously.
4+
5+
During the Baroque era professional musicians we

0 commit comments

Comments
 (0)