A comprehensive implementation of different Retrieval-Augmented Generation (RAG) techniques for question-answering systems. This project demonstrates various embedding and retrieval strategies using the book "How to Lie with Statistics" as the knowledge base.
- Overview
- Techniques Implemented
- Project Structure
- Installation
- Usage
- Demo Scripts
- Results
- Technical Details
This project implements and compares different approaches to building a RAG system, focusing on:
- Two Embedding Techniques: Simple chunking vs. Proposition-based embedding
- Two Retrieval Techniques: Simple retrieval vs. Reliable retrieval with proof
The goal is to showcase how different techniques affect retrieval quality, answer accuracy, and system transparency.
Approach: Traditional chunking strategy using RecursiveCharacterTextSplitter.
How it works:
- Loads PDF document
- Splits text into chunks of 1000 characters with 200 character overlap
- Creates vector embeddings directly from chunks
- Stores embeddings in FAISS vector database
Advantages:
- Fast setup and processing
- Straightforward implementation
- Works well for general-purpose retrieval
Use cases:
- Quick prototyping
- General document search
- When speed is prioritized over precision
Implementation: src/Embedding/SimpleEmbedding.py
Approach: Breaks documents into atomic, self-contained propositions using LLM.
How it works:
- Loads and chunks PDF document (same as simple embedding)
- Uses LLM to extract atomic facts (propositions) from each chunk
- Each proposition is a single, self-contained statement
- Applies quality check on propositions using 4 criteria:
- Accuracy: How well it reflects the original text (threshold: 7/10)
- Clarity: How understandable it is without context (threshold: 7/10)
- Completeness: Whether it includes necessary details (threshold: 7/10)
- Conciseness: Whether it's concise without losing information (threshold: 7/10)
- Only high-quality propositions are embedded
- Creates vector embeddings from filtered propositions
Advantages:
- More precise retrieval of specific facts
- Self-contained information units
- Better for fact-based queries
- Quality control ensures high-quality embeddings
Trade-offs:
- Slower setup due to LLM processing
- Higher API costs
- More complex implementation
Use cases:
- Fact-checking systems
- Scientific literature search
- When precision is critical
Implementation: src/Embedding/PropositionsEmbedding.py
Approach: Standard RAG chain with vector similarity search.
How it works:
- Receives user question
- Performs vector similarity search to retrieve relevant documents/propositions
- Passes retrieved context to LLM
- Generates concise answer (maximum 3 sentences)
Advantages:
- Fast response time
- Simple implementation
- Good for straightforward queries
Limitations:
- No relevance filtering
- No proof/evidence tracking
- May include irrelevant retrieved documents
Implementation: src/Retrieval/SimpleAsk.py
Approach: Enhanced retrieval with document filtering and proof extraction.
How it works:
- Retrieval: Performs vector similarity search
- Filtering: Uses LLM to grade each retrieved document for relevance
- Only documents graded as "yes" are kept
- Filters out erroneous retrievals
- Generation: Generates answer using only filtered documents
- Proof Extraction: Identifies exact text segments from documents that support the answer
- Extracts verbatim snippets
- Links segments to their sources
- Provides transparency and verifiability
Advantages:
- Higher quality answers through filtering
- Provides evidence/proof for answers
- Transparent and verifiable
- Better trust and accountability
Trade-offs:
- Slower response time (additional LLM calls)
- Higher API costs
- More complex implementation
Use cases:
- High-stakes decision making
- Research and fact-checking
- When transparency is required
- Systems requiring answer verification
Output format:
{
"answer": "The generated answer...",
"highlights": {
"id": ["doc1", "doc2"],
"title": ["How To Lie With Statistics", ...],
"source": ["https://...", ...],
"segment": ["Exact text from document 1...", "Exact text from document 2..."]
}
}Implementation: src/Retrieval/ReliableAsk.py
lying-stats-rag/
├── src/
│ ├── Embedding/
│ │ ├── Embedding.py # Abstract base class for embeddings
│ │ ├── SimpleEmbedding.py # Simple chunking implementation
│ │ ├── PropositionsEmbedding.py # Proposition-based implementation
│ │ └── models.py # Pydantic models for propositions
│ ├── Retrieval/
│ │ ├── Retriever.py # Abstract base class for retrievers
│ │ ├── SimpleAsk.py # Simple retrieval implementation
│ │ ├── ReliableAsk.py # Reliable retrieval with proof
│ │ └── models.py # Pydantic models for retrieval
│ └── utils.py # Utility functions
├── scripts/
│ ├── demo_comparison.py # Compare all techniques
│ ├── demo_simple_embedding.py # Simple embedding demo
│ ├── demo_proposition_embedding.py # Proposition embedding demo
│ ├── README.md # Demo scripts documentation
│ └── results/ # Output directory for results
├── utils/
│ ├── demo_comparison.png # Screenshot of comparison demo
│ ├── demo_simple_embedding.png # Screenshot of simple embedding demo
│ └── demo_proposition_embedding.png # Screenshot of proposition demo
├── data/
│ └── HowToLieWithStatistics.pdf # Source document
├── config.py # LLM and embedding configuration
├── logger_config.py # Logging setup
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8+
- Mistral API key
- Clone the repository:
git clone <repository-url>
cd lying-stats-rag- Create virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create
.envfile in project root:
MISTRAL_API_KEY=your_api_key_herefrom src.Embedding.SimpleEmbedding import SimpleEmbedding
from src.Retrieval.SimpleAsk import SimpleAsk
from config import embed_mistral, mistral
# Setup embedding
embed = SimpleEmbedding(embed_llm=embed_mistral, data_path="path/to/document.pdf")
embed.load()
embed.split()
vectorstore = embed.vectorize()
# Setup retrieval
retriever = SimpleAsk(vectorstore=vectorstore, llm=mistral)
# Ask question
answer = retriever.ask("How to choose a loose average?")
print(answer)from src.Embedding.PropositionsEmbedding import PropositionsEmbedding
from src.Retrieval.ReliableAsk import ReliableAsk
from config import embed_mistral, mistral
# Setup proposition embedding
embed = PropositionsEmbedding(embed_llm=embed_mistral, data_path="path/to/document.pdf", llm=mistral)
embed.load()
embed.split()
embed.propositions = embed.generate_propositions()
embed.propositions_quality_check()
vectorstore = embed.vectorize()
# Setup reliable retrieval
retriever = ReliableAsk(vectorstore=vectorstore, llm=mistral)
# Ask question with proof
result = retriever.ask("How to choose a loose average?")
print(result.answer)
print(result.highlights.segment) # Proof segmentsThe project includes three demonstration scripts to showcase different techniques:
Compares all four technique combinations side-by-side:
python scripts/demo_comparison.pyWhat it shows:
- Setup time for each embedding technique
- Response time for each retrieval method
- Number of documents/propositions retrieved
- Proof segments for reliable retrieval
- Side-by-side comparison tables
- Saves detailed results to JSON
Screenshot: utils/demo_comparison.png
Deep dive into the simple embedding technique:
python scripts/demo_simple_embedding.pyWhat it shows:
- Document loading and chunking process
- Sample chunk previews
- Comparison of Simple vs Reliable retrieval
- Retrieved documents
- Proof segments with sources
Screenshot: utils/demo_simple_embedding.png
Deep dive into proposition-based embedding:
python scripts/demo_proposition_embedding.pyWhat it shows:
- Proposition generation from chunks
- Sample propositions
- Quality check statistics and filtering
- Pass rate metrics
- Comparison of retrieval methods on propositions
Screenshot: utils/demo_proposition_embedding.png
For detailed documentation on demo scripts, see scripts/README.md.
Based on testing with the "How to Lie with Statistics" document:
Setup Time:
- Simple Embedding: ~5-10 seconds
- Proposition Embedding: ~40-60 seconds (due to LLM processing)
Query Response Time:
- Simple Retrieval: ~1-2 seconds
- Reliable Retrieval: ~5-8 seconds (includes filtering and proof extraction)
Retrieval Quality:
- Simple Embedding + Simple Retrieval: Fast but may include irrelevant context
- Simple Embedding + Reliable Retrieval: Filtered results with proof
- Proposition Embedding + Simple Retrieval: More precise fact retrieval
- Proposition Embedding + Reliable Retrieval: Highest quality with verifiable evidence
Proposition Statistics (typical):
- Original chunks: ~100-150
- Propositions generated: ~400-500
- Propositions after quality check: ~350-450 (80-90% pass rate)
- Proposition-based embedding significantly improves retrieval precision for fact-based queries
- Reliable retrieval adds transparency through proof segments, crucial for trust
- Quality filtering in proposition generation maintains high embedding quality
- Trade-off: Speed vs. Quality - choose based on use case requirements
The project uses an abstract base class pattern for extensibility:
Embedding hierarchy:
Embedding (ABC)
├── SimpleEmbedding
└── PropositionsEmbedding
Retrieval hierarchy:
Retriever (ABC)
├── SimpleAsk
└── ReliableAsk
- LangChain: Framework for LLM applications
- Mistral AI: LLM for generation and proposition extraction
- FAISS: Vector database for similarity search
- Pydantic: Data validation and structured outputs
- PyPDF: PDF document loading
- Colorama: Terminal output formatting
- Tabulate: Table formatting for comparisons
Embedding Model: mistral-embed
- Converts text to vector embeddings
LLM Models:
mistral-large-latest: Main model for generation and quality checksmistral-3-24b: Alternative model (configurable)
For Proposition Generation:
class GeneratePropositions(BaseModel):
propositions: List[str]For Quality Grading:
class GradePropositions(BaseModel):
accuracy: int # 1-10
clarity: int # 1-10
completeness: int # 1-10
conciseness: int # 1-10For Document Grading:
class GradeDocuments(BaseModel):
binary_score: str # 'yes' or 'no'For Proof Extraction:
class HighlightDocuments(BaseModel):
id: List[str]
title: List[str]
source: List[str]
segment: List[str] # Exact text segmentsFor Reliable Answers:
class ReliableAnswer(BaseModel):
answer: str
highlights: HighlightDocumentsThis is a training/showcase project. Feel free to fork and experiment with:
- Different embedding models
- Alternative chunking strategies
- Additional retrieval techniques
- Different quality thresholds
- Other LLM providers
This project is for educational and demonstration purposes.
- Document source: "How to Lie with Statistics" by Darrell Huff
- Built with LangChain and Mistral AI
- Inspired by advanced RAG techniques in modern AI systems


