Skip to content

lilhuss26/ProofRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Techniques Showcase

A comprehensive implementation of different Retrieval-Augmented Generation (RAG) techniques for question-answering systems. This project demonstrates various embedding and retrieval strategies using the book "How to Lie with Statistics" as the knowledge base.

Table of Contents

Overview

This project implements and compares different approaches to building a RAG system, focusing on:

  1. Two Embedding Techniques: Simple chunking vs. Proposition-based embedding
  2. Two Retrieval Techniques: Simple retrieval vs. Reliable retrieval with proof

The goal is to showcase how different techniques affect retrieval quality, answer accuracy, and system transparency.

Techniques Implemented

Embedding Techniques

1. Simple Embedding

Approach: Traditional chunking strategy using RecursiveCharacterTextSplitter.

How it works:

  • Loads PDF document
  • Splits text into chunks of 1000 characters with 200 character overlap
  • Creates vector embeddings directly from chunks
  • Stores embeddings in FAISS vector database

Advantages:

  • Fast setup and processing
  • Straightforward implementation
  • Works well for general-purpose retrieval

Use cases:

  • Quick prototyping
  • General document search
  • When speed is prioritized over precision

Implementation: src/Embedding/SimpleEmbedding.py


2. Proposition-based Embedding

Approach: Breaks documents into atomic, self-contained propositions using LLM.

How it works:

  • Loads and chunks PDF document (same as simple embedding)
  • Uses LLM to extract atomic facts (propositions) from each chunk
  • Each proposition is a single, self-contained statement
  • Applies quality check on propositions using 4 criteria:
    • Accuracy: How well it reflects the original text (threshold: 7/10)
    • Clarity: How understandable it is without context (threshold: 7/10)
    • Completeness: Whether it includes necessary details (threshold: 7/10)
    • Conciseness: Whether it's concise without losing information (threshold: 7/10)
  • Only high-quality propositions are embedded
  • Creates vector embeddings from filtered propositions

Advantages:

  • More precise retrieval of specific facts
  • Self-contained information units
  • Better for fact-based queries
  • Quality control ensures high-quality embeddings

Trade-offs:

  • Slower setup due to LLM processing
  • Higher API costs
  • More complex implementation

Use cases:

  • Fact-checking systems
  • Scientific literature search
  • When precision is critical

Implementation: src/Embedding/PropositionsEmbedding.py


Retrieval Techniques

1. Simple Retrieval (SimpleAsk)

Approach: Standard RAG chain with vector similarity search.

How it works:

  • Receives user question
  • Performs vector similarity search to retrieve relevant documents/propositions
  • Passes retrieved context to LLM
  • Generates concise answer (maximum 3 sentences)

Advantages:

  • Fast response time
  • Simple implementation
  • Good for straightforward queries

Limitations:

  • No relevance filtering
  • No proof/evidence tracking
  • May include irrelevant retrieved documents

Implementation: src/Retrieval/SimpleAsk.py


2. Reliable Retrieval (ReliableAsk)

Approach: Enhanced retrieval with document filtering and proof extraction.

How it works:

  1. Retrieval: Performs vector similarity search
  2. Filtering: Uses LLM to grade each retrieved document for relevance
    • Only documents graded as "yes" are kept
    • Filters out erroneous retrievals
  3. Generation: Generates answer using only filtered documents
  4. Proof Extraction: Identifies exact text segments from documents that support the answer
    • Extracts verbatim snippets
    • Links segments to their sources
    • Provides transparency and verifiability

Advantages:

  • Higher quality answers through filtering
  • Provides evidence/proof for answers
  • Transparent and verifiable
  • Better trust and accountability

Trade-offs:

  • Slower response time (additional LLM calls)
  • Higher API costs
  • More complex implementation

Use cases:

  • High-stakes decision making
  • Research and fact-checking
  • When transparency is required
  • Systems requiring answer verification

Output format:

{
  "answer": "The generated answer...",
  "highlights": {
    "id": ["doc1", "doc2"],
    "title": ["How To Lie With Statistics", ...],
    "source": ["https://...", ...],
    "segment": ["Exact text from document 1...", "Exact text from document 2..."]
  }
}

Implementation: src/Retrieval/ReliableAsk.py


Project Structure

lying-stats-rag/
├── src/
│   ├── Embedding/
│   │   ├── Embedding.py                 # Abstract base class for embeddings
│   │   ├── SimpleEmbedding.py           # Simple chunking implementation
│   │   ├── PropositionsEmbedding.py     # Proposition-based implementation
│   │   └── models.py                    # Pydantic models for propositions
│   ├── Retrieval/
│   │   ├── Retriever.py                 # Abstract base class for retrievers
│   │   ├── SimpleAsk.py                 # Simple retrieval implementation
│   │   ├── ReliableAsk.py               # Reliable retrieval with proof
│   │   └── models.py                    # Pydantic models for retrieval
│   └── utils.py                         # Utility functions
├── scripts/
│   ├── demo_comparison.py               # Compare all techniques
│   ├── demo_simple_embedding.py         # Simple embedding demo
│   ├── demo_proposition_embedding.py    # Proposition embedding demo
│   ├── README.md                        # Demo scripts documentation
│   └── results/                         # Output directory for results
├── utils/
│   ├── demo_comparison.png              # Screenshot of comparison demo
│   ├── demo_simple_embedding.png        # Screenshot of simple embedding demo
│   └── demo_proposition_embedding.png   # Screenshot of proposition demo
├── data/
│   └── HowToLieWithStatistics.pdf       # Source document
├── config.py                            # LLM and embedding configuration
├── logger_config.py                     # Logging setup
├── requirements.txt                     # Python dependencies
└── README.md                            # This file

Installation

Prerequisites

  • Python 3.8+
  • Mistral API key

Setup

  1. Clone the repository:
git clone <repository-url>
cd lying-stats-rag
  1. Create virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Create .env file in project root:
MISTRAL_API_KEY=your_api_key_here

Usage

Basic Usage

from src.Embedding.SimpleEmbedding import SimpleEmbedding
from src.Retrieval.SimpleAsk import SimpleAsk
from config import embed_mistral, mistral

# Setup embedding
embed = SimpleEmbedding(embed_llm=embed_mistral, data_path="path/to/document.pdf")
embed.load()
embed.split()
vectorstore = embed.vectorize()

# Setup retrieval
retriever = SimpleAsk(vectorstore=vectorstore, llm=mistral)

# Ask question
answer = retriever.ask("How to choose a loose average?")
print(answer)

Using Proposition-based Embedding

from src.Embedding.PropositionsEmbedding import PropositionsEmbedding
from src.Retrieval.ReliableAsk import ReliableAsk
from config import embed_mistral, mistral

# Setup proposition embedding
embed = PropositionsEmbedding(embed_llm=embed_mistral, data_path="path/to/document.pdf", llm=mistral)
embed.load()
embed.split()
embed.propositions = embed.generate_propositions()
embed.propositions_quality_check()
vectorstore = embed.vectorize()

# Setup reliable retrieval
retriever = ReliableAsk(vectorstore=vectorstore, llm=mistral)

# Ask question with proof
result = retriever.ask("How to choose a loose average?")
print(result.answer)
print(result.highlights.segment)  # Proof segments

Demo Scripts

The project includes three demonstration scripts to showcase different techniques:

1. Comparison Demo

Compares all four technique combinations side-by-side:

python scripts/demo_comparison.py

What it shows:

  • Setup time for each embedding technique
  • Response time for each retrieval method
  • Number of documents/propositions retrieved
  • Proof segments for reliable retrieval
  • Side-by-side comparison tables
  • Saves detailed results to JSON

Screenshot: utils/demo_comparison.png

Comparison Demo


2. Simple Embedding Demo

Deep dive into the simple embedding technique:

python scripts/demo_simple_embedding.py

What it shows:

  • Document loading and chunking process
  • Sample chunk previews
  • Comparison of Simple vs Reliable retrieval
  • Retrieved documents
  • Proof segments with sources

Screenshot: utils/demo_simple_embedding.png

Simple Embedding Demo


3. Proposition Embedding Demo

Deep dive into proposition-based embedding:

python scripts/demo_proposition_embedding.py

What it shows:

  • Proposition generation from chunks
  • Sample propositions
  • Quality check statistics and filtering
  • Pass rate metrics
  • Comparison of retrieval methods on propositions

Screenshot: utils/demo_proposition_embedding.png

Proposition Embedding Demo


For detailed documentation on demo scripts, see scripts/README.md.

Results

Performance Comparison

Based on testing with the "How to Lie with Statistics" document:

Setup Time:

  • Simple Embedding: ~5-10 seconds
  • Proposition Embedding: ~40-60 seconds (due to LLM processing)

Query Response Time:

  • Simple Retrieval: ~1-2 seconds
  • Reliable Retrieval: ~5-8 seconds (includes filtering and proof extraction)

Retrieval Quality:

  • Simple Embedding + Simple Retrieval: Fast but may include irrelevant context
  • Simple Embedding + Reliable Retrieval: Filtered results with proof
  • Proposition Embedding + Simple Retrieval: More precise fact retrieval
  • Proposition Embedding + Reliable Retrieval: Highest quality with verifiable evidence

Proposition Statistics (typical):

  • Original chunks: ~100-150
  • Propositions generated: ~400-500
  • Propositions after quality check: ~350-450 (80-90% pass rate)

Key Findings

  1. Proposition-based embedding significantly improves retrieval precision for fact-based queries
  2. Reliable retrieval adds transparency through proof segments, crucial for trust
  3. Quality filtering in proposition generation maintains high embedding quality
  4. Trade-off: Speed vs. Quality - choose based on use case requirements

Technical Details

Architecture

The project uses an abstract base class pattern for extensibility:

Embedding hierarchy:

Embedding (ABC)
├── SimpleEmbedding
└── PropositionsEmbedding

Retrieval hierarchy:

Retriever (ABC)
├── SimpleAsk
└── ReliableAsk

Technologies Used

  • LangChain: Framework for LLM applications
  • Mistral AI: LLM for generation and proposition extraction
  • FAISS: Vector database for similarity search
  • Pydantic: Data validation and structured outputs
  • PyPDF: PDF document loading
  • Colorama: Terminal output formatting
  • Tabulate: Table formatting for comparisons

Models

Embedding Model: mistral-embed

  • Converts text to vector embeddings

LLM Models:

  • mistral-large-latest: Main model for generation and quality checks
  • mistral-3-24b: Alternative model (configurable)

Pydantic Models

For Proposition Generation:

class GeneratePropositions(BaseModel):
    propositions: List[str]

For Quality Grading:

class GradePropositions(BaseModel):
    accuracy: int      # 1-10
    clarity: int       # 1-10
    completeness: int  # 1-10
    conciseness: int   # 1-10

For Document Grading:

class GradeDocuments(BaseModel):
    binary_score: str  # 'yes' or 'no'

For Proof Extraction:

class HighlightDocuments(BaseModel):
    id: List[str]
    title: List[str]
    source: List[str]
    segment: List[str]  # Exact text segments

For Reliable Answers:

class ReliableAnswer(BaseModel):
    answer: str
    highlights: HighlightDocuments

Contributing

This is a training/showcase project. Feel free to fork and experiment with:

  • Different embedding models
  • Alternative chunking strategies
  • Additional retrieval techniques
  • Different quality thresholds
  • Other LLM providers

License

This project is for educational and demonstration purposes.

Acknowledgments

  • Document source: "How to Lie with Statistics" by Darrell Huff
  • Built with LangChain and Mistral AI
  • Inspired by advanced RAG techniques in modern AI systems

About

Compare RAG techniques: simple vs. proposition-based embedding, standard vs. reliable retrieval with proof extraction. Includes demo scripts and performance benchmarks.

Topics

Resources

Stars

Watchers

Forks

Languages