This repository isolates reranking as a system boundary in RAG. Using a frozen hybrid-retrieval candidate pool, we show that:
- Naive heuristic reranking is unstable and often degrades ranking
- Learned relevance (cross-encoders) materially improves Top-K inclusion
- Reranking improves priority, not recall, and saturates when evidence is poorly chunked
Conclusion: ranking is a real bottleneck — but only strong relevance signals help, and even they have limits.
The previous repository, rag-hybrid-retrieval, established a critical result:
Hybrid retrieval improves evidence surfacing — but does not reliably convert surfaced evidence into Top-K inclusion.
In other words:
- The right chunks are often present in the candidate pool
- But they are misordered relative to less decisive neighbors
- And therefore never reach the generator (Top-K = 4)
This repository exists to isolate and answer a single, narrow question:
Given a fixed candidate pool that already contains the correct evidence, can reranking reliably promote that evidence into Top-K?
This is not an optimization demo. It is a controlled experiment in ranking failure and resolution.
This repository introduces an explicit reranking stage after retrieval and measures its impact using the unchanged evaluation harness from rag-retrieval-eval, under a frozen retrieval contract.
It answers:
- Whether reranking improves Top-K inclusion
- Whether improvements depend on reranking signal strength
- Which question intents benefit from reranking
- Which failure modes reranking does not resolve
The focus is ranking quality, not answer quality.
This repository deliberately avoids:
- Changing the corpus
- Changing chunking strategy
- Changing embeddings
- Changing dense or sparse retrieval
- Prompt engineering
- LLM-based grading
- Agent behavior or tool use
- Any claim that reranking “fixes RAG”
If an improvement cannot be attributed solely to reranking, it does not belong here.
This repository builds directly on:
-
rag-minimal-controlA strict, deterministic RAG control system -
rag-retrieval-evalA retrieval observability and evaluation harness -
rag-hybrid-retrievalDense + sparse retrieval with explicit hybrid merge logic
All upstream components remain frozen and authoritative, including:
- Corpus
- Chunking
- Embeddings
- Dense similarity function
- Sparse retriever (BM25)
- Hybrid merge logic
- Top-K passed to the generator (K = 4)
- Evaluation metrics
The only new system component is an explicit reranking stage.
Repo Contract
Inputs
- Hybrid-retrieved candidate pool (Top-N = 42)
- Deterministic evaluation questions
- Gold chunk labels (for evaluation only)
Outputs
- Reranked candidate lists
- Rank-of-first-relevant metrics
- Δ vs hybrid baseline
No retrieval decisions are altered upstream.
Document → Chunk → Embed
↘
Dense Retriever +
Query ───────────→ Sparse Retriever
↓
Explicit Hybrid Merge
↓
Candidate Pool (Top-N)
↓
Reranking Stage
↓
Top-K → Generator (unchanged)
Critical constraint
The candidate pool is identical before and after reranking. Reranking may only reorder, never add or remove candidates.
This repository evaluates two reranking classes, applied only to the frozen hybrid candidate pool.
-
Linear combination of interpretable signals:
- normalized dense score
- normalized sparse score
- lexical overlap
- keyphrase match
- length penalty
-
No learning
-
Fully inspectable
Purpose: Failure analysis and causal clarity — not performance maximization.
- Jointly encodes (query, chunk) pairs
- Produces a learned relevance score
- Used strictly as a ranking signal
Constraints
- No access to gold labels
- No corpus-level statistics
- No modification of candidate membership
- Evaluated under the same frozen contract
All evaluation uses the unchanged harness from rag-retrieval-eval.
Metrics (Locked)
- Rank of First Relevant Chunk
- Context Recall @ K (K = 4)
- Δ vs Hybrid Baseline
No new metrics are introduced.
Reranking was evaluated over 54 deterministic questions, using a fixed candidate pool (Top-N = 42) and Top-K = 4 passed to the generator.
| Metric | Observed |
|---|---|
| Total questions | 54 |
| Questions with relevant evidence in pool | ~50 |
| Median rank (heuristic reranker) | ~15–18 |
| Median rank (cross-encoder reranker) | ~2–3 |
| Top-K success rate (heuristic) | ~5–8% |
| Top-K success rate (cross-encoder) | ~35–40% |
Interpretation
- Heuristic reranking does not reliably improve ranking and often degrades it
- Cross-encoder reranking materially improves evidence prioritization
- Gains are achieved without changing retrieval, embeddings, or chunking
Reranking benefits are not uniform across question types.
-
Definition questions
- Clear lexical and semantic anchors
-
Procedural questions
- Step-like structure and ordering cues
-
Scope / inventory questions
- Enumerations and inclusion language
-
Rationale / principle questions
- Evidence distributed across multiple chunks
- Explanatory prose with weak local anchors
Implication
Reranking improves decisiveness, not semantic synthesis.
When correct evidence is:
- localized → reranking helps
- distributed → reranking saturates quickly
These failures indicate upstream limits (chunking, representation), not reranking defects.
The following failure modes persist after reranking:
- Missing evidence in the candidate pool
- Gold evidence split across multiple chunks
- Queries requiring cross-chunk reasoning
- Generator ignoring provided evidence
These remain out of scope by design.
git clone https://github.com/Arnav-Ajay/rag-reranking-playground.git
cd rag-reranking-playground
pip install -r requirements.txtThis runs the explainable heuristic reranker, which is the default mode.
python reranker.pyThis produces:
- a question-level reranked artifact
- reranking metrics computed under the unchanged evaluation harness
To evaluate learned relevance-based reranking, explicitly enable cross-encoder mode:
python reranker.py --rerank-mode cross-encoderBy default, this uses:
cross-encoder/ms-marco-MiniLM-L-6-v2
as the relevance model.
The retrieval pipeline, candidate pool, and evaluation metrics remain fully unchanged.
The default arguments are set to match the frozen contract and do not need to be changed to reproduce reported results.
They are exposed only for controlled experimentation.
--input-csv data/chunks_and_questions/input_artifact.csv
--chunks-csv data/chunks_and_questions/chunks_output.csv
--output-csv data/results_and_summaries/rag_reranked_artifact.csv
--debug-candidates (optional) candidate-level debug output--top-n 20 or 50 # Candidate pool size (must match rag-hybrid-retrieval inspect_k)
--k 4 # Top-K passed to generator (locked across previous repos)Changing these values breaks direct comparability with prior repositories.
These weights control the linear heuristic reranker only:
--wd 0.4 # normalized dense score
--wb 0.3 # normalized sparse (BM25) score
--wo 0.1 # lexical overlap
--wk 0.1 # keyphrase hit rate
--wp 0.0 # pattern cues
--wl 0.1 # length penaltyThese are intentionally exposed to support:
- ablation
- sensitivity analysis
- failure inspection
They are not tuned for optimal performance.
--rerank-mode cross-encoder
--cross-encoder-model cross-encoder/ms-marco-MiniLM-L-6-v2Any compatible sentence-transformers cross-encoder can be substituted, provided:
- it scores (query, chunk) pairs
- it does not access gold labels
- it does not modify candidate membership
All reported results in this repository were produced using:
- the default arguments
- a frozen retrieval contract
- an identical candidate pool before and after reranking
Changing configuration parameters is explicitly exploratory and should not be conflated with the main findings.
This repository demonstrates that:
Reranking is a first-class system boundary that can materially improve retrieval quality — but only when supplied with sufficiently strong relevance signals.
Specifically:
- Heuristic reranking is unstable and unreliable
- Learned relevance (cross-encoders) substantially improves Top-K inclusion
- Reranking does not expand recall
- Reranking does not guarantee correct answers
The remaining bottleneck is how evidence is chunked and represented, motivating rag-chunking-strategies.