Skip to content

DiegoPaezA/got-rag-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

85 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GOT RAG Chatbot

A sophisticated Knowledge Graph and Retrieval-Augmented Generation (RAG) system for Game of Thrones lore. This project combines web scraping, heuristic entity extraction, LLM-powered validation, and graph construction to create an intelligent, context-aware knowledge base.

🌟 Features

  • Intelligent Web Scraping: Automated extraction from Game of Thrones Fandom Wiki with resume capability
  • Hybrid Knowledge Graph Construction:
    • Heuristic-based entity extraction and type classification
    • Optional LLM validation (Gemini) for improved accuracy
    • Schema-aware relationship building with business logic constraints
  • Advanced Text Processing: WikiText parsing, infobox extraction, and intelligent text cleaning
  • Resumable Pipeline: Checkpoint system allows interrupted processes to resume without data loss
  • Batch Processing: Efficient handling of large datasets with configurable batch sizes
  • Error Handling: Robust retry logic with exponential backoff for API rate limits
  • Extensible Architecture: Modular design with clear separation of concerns
πŸ“ Project Structure
got-rag-chatbot/
β”‚
β”œβ”€β”€ .env                    # Environment variables (GOOGLE_API_KEY for Gemini)
β”œβ”€β”€ .gitignore              
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml          # Project metadata and dependencies
β”‚
β”œβ”€β”€ cfg/                    # Configuration files
β”‚   └── config.json         # LLM settings, prompts, and schema definitions
β”‚
β”œβ”€β”€ data/                   # Data storage (gitignored)
β”‚   β”œβ”€β”€ raw/                
β”‚   β”‚   └── wiki_dump.jsonl # Raw scraped data from wiki
β”‚   β”œβ”€β”€ processed/          # Generated knowledge graph files
β”‚   β”‚   β”œβ”€β”€ nodes.jsonl             # Heuristic nodes
β”‚   β”‚   β”œβ”€β”€ nodes_validated.jsonl   # LLM-validated nodes
β”‚   β”‚   β”œβ”€β”€ nodes_llm_checkpoint.jsonl # Validation checkpoint
β”‚   β”‚   β”œβ”€β”€ edges.jsonl             # Graph relationships
β”‚   β”‚   └── documents.jsonl         # Text documents for RAG
β”‚   └── chromadb/           # Vector database (future feature)
β”‚
β”œβ”€β”€ src/                    # Main application source code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py           # Configuration loader
β”‚   β”‚
β”‚   β”œβ”€β”€ core/               # Core components
β”‚   β”‚   β”œβ”€β”€ database.py     # Vector DB connection logic
β”‚   β”‚   └── llm.py          # LLM Client setup
β”‚   β”‚
β”‚   β”œβ”€β”€ ingestion/          # Data extraction and processing
β”‚   β”‚   β”œβ”€β”€ scraper.py      # Fandom Wiki scraper with resume capability
β”‚   β”‚   β”œβ”€β”€ processor.py    # Text processing and cleaning
β”‚   β”‚   └── loader.py       # Database loading logic
β”‚   β”‚
β”‚   β”œβ”€β”€ graph/              # Knowledge Graph construction
β”‚   β”‚   β”œβ”€β”€ builder.py      # Heuristic node extraction and typing
β”‚   β”‚   β”œβ”€β”€ validator.py    # LLM-powered node validation
β”‚   β”‚   └── edge_builder.py # Schema-aware relationship extraction
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/              # Utility functions
β”‚   β”‚   └── text.py         # Text cleaning and normalization
β”‚   β”‚
β”‚   β”œβ”€β”€ rag/                # RAG system (future enhancement)
β”‚   β”‚   β”œβ”€β”€ retriever.py    
β”‚   β”‚   β”œβ”€β”€ chain.py        
β”‚   β”‚   └── engine.py       
β”‚   β”‚
β”‚   β”œβ”€β”€ schemas/            # Pydantic models
β”‚   β”‚   β”œβ”€β”€ chat.py         
β”‚   β”‚   └── document.py     
β”‚   β”‚
β”‚   └── api/                # FastAPI web layer (future feature)
β”‚       β”œβ”€β”€ main.py         
β”‚       β”œβ”€β”€ routes.py       
β”‚       └── dependencies.py 
β”‚
└── main.py                 # CLI orchestrator

πŸš€ Getting Started

Prerequisites

  • Python 3.8+
  • uv (recommended) or pip
  • Google Gemini API Key (for LLM validation feature)

Installation

  1. Clone the repository
git clone https://github.com/DiegoPaezA/got-rag-chatbot.git
cd got-rag-chatbot
  1. Install dependencies

Using uv (Recommended)

# Install dependencies and create virtual environment
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Set up environment variables

Create a .env file in the project root:

GOOGLE_API_KEY=your_gemini_api_key_here

Configuration

Edit cfg/config.json to customize:

  • LLM settings (model, temperature, retries)
  • Entity types and validation prompts
  • Schema constraints for relationships

πŸ“– Usage

1. Scrape Wiki Data

Download raw data from Game of Thrones Fandom Wiki:

python main.py scrape

Features:

  • Automatically resumes if interrupted
  • Avoids duplicates using ID tracking
  • Progress bar with real-time statistics
  • Polite rate limiting (0.1s delay between requests)

Output: data/raw/wiki_dump.jsonl

2. Build Knowledge Graph

Extract entities and relationships from raw data:

# Build with heuristics only (fast)
python main.py build --use_heuristic

# Build with LLM validation (more accurate, requires API key)
python main.py build --use-llm --clean-llm

3. Run Retrieval-Augmented Generation (RAG) Chatbot

3.1. (Run the Neo4j instance)

docker-compose -f docker-compose.yml up -d

3.2. Start the RAG Chatbot

Run the Streamlit app:

uv run streamlit run src/app.py

Run the CLI chatbot:

uv run python test_rag.py

Pipeline Steps:

Step 1: Heuristic Node Extraction

  • Parses WikiText and extracts infoboxes
  • Applies scoring system to classify entities (Character, House, Location, etc.)
  • Generates confidence scores (High/Medium/Low)
  • Output: data/processed/nodes.jsonl

Step 2: LLM Validation (Optional)

  • Validates low-confidence and ambiguous nodes
  • Uses Google Gemini with structured output
  • Batch processing with checkpoint system
  • Handles API rate limits with exponential backoff
  • Output: data/processed/nodes_validated.jsonl

Step 3: Schema-Aware Edge Generation

  • Extracts relationships from node properties
  • Validates edges against schema constraints
  • Prevents semantic errors (e.g., "Sword is father of Character")
  • Deduplicates relationships
  • Output: data/processed/edges.jsonl

πŸ—οΈ Architecture

Knowledge Graph Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Wiki API   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ Scraper (resumable)
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ wiki_dump.jsonl β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ Builder (heuristic scoring)
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  nodes  │────────▢│  documents   β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                  (for RAG)
         β”‚ Validator (LLM)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ nodes_validated  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ EdgeBuilder (schema-aware)
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  edges β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

1. FandomScraper (src/ingestion/scraper.py)

  • MediaWiki API integration
  • Checkpoint-based resume capability
  • WikiText parsing with mwparserfromhell
  • Infobox extraction

2. GraphBuilder (src/graph/builder/builder.py)

  • Heuristic type classification using scoring system
  • Property extraction and cleaning
  • Document generation for RAG

3. GraphValidator (src/graph/validator.py)

  • LLM-powered validation with Gemini
  • Batch processing with configurable size
  • Checkpoint system for long-running validations
  • Intelligent retry logic for API failures
  • Validates nodes with:
    • Low confidence scores
    • Ambiguous types (Lore, Organization, Object)
    • Close type scores (within 2 points)

4. EdgeBuilder (src/graph/edge_builder.py)

  • Property-to-relationship mapping
  • Schema constraint validation
  • Supports multiple relationship types:
    • Family (CHILD_OF, PARENT_OF, SIBLING_OF, MARRIED_TO)
    • Loyalty (BELONGS_TO, SWORN_TO, VASSAL_OF)
    • Geography (LOCATED_IN, SEATED_AT)
    • Culture (HAS_CULTURE, FOLLOWS_RELIGION)
    • War (PARTICIPANT_IN, COMMANDED_BY)
    • Objects (OWNED_BY, WIELDED_BY, CREATED_BY)
    • Meta (PLAYED_BY, APPEARED_IN_SEASON)

Schema Constraints

The system enforces semantic correctness through schema constraints. For example:

{
    "CHILD_OF": ["Character", "Creature"],  # Only these types can have parents
    "MARRIED_TO": ["Character"],            # Only characters can marry
    "SEATED_AT": ["House"],                 # Only houses have seats
    "OWNS_WEAPON": ["Character"]            # Only characters own weapons
}

This prevents illogical relationships like "Ice (sword) is the father of Jon Snow".

πŸ“Š Data Formats

Node Schema

{
    "id": "Jon_Snow",
    "type": "Character",
    "confidence": "High (LLM)",
    "reason": "Has actor, born properties",
    "type_scores": {"Character": 5, "House": 0, ...},
    "properties": {
        "Father": "Rhaegar Targaryen",
        "House": "Stark",
        "Actor": "Kit Harington"
    },
    "url": "https://gameofthrones.fandom.com/wiki/Jon_Snow"
}

Edge Schema

{
    "source": "Jon_Snow",
    "relation": "CHILD_OF",
    "target": "Rhaegar_Targaryen"
}

Document Schema

{
    "id": "Jon_Snow",
    "text": "Jon Snow is a Character. Jon Snow is the son of...",
    "metadata": {
        "type": "Character",
        "source": "wiki_dump"
    }
}

πŸ› οΈ Advanced Features

Resume Capability

Both scraper and validator support resuming interrupted runs:

  • Scraper: Tracks processed article IDs in the output file
  • Validator: Uses checkpoint file to track validated nodes

Simply re-run the same command to resume where it left off.

Batch Processing

The validator processes nodes in configurable batches (default: 10):

  • Reduces API calls
  • Enables checkpoint saves between batches
  • Better error recovery

Error Handling

  • 429 Rate Limits: Exponential backoff with jitter (4s β†’ 8s β†’ 16s β†’ ...)
  • Network Errors: Automatic retry with delay
  • Parse Errors: Graceful skip with logging
  • Schema Violations: Tracked and reported in statistics

πŸ”§ Configuration

cfg/config.json Structure

{
    "llm_settings": {
        "model_name": "gemini-2.5-flash",
        "temperature": 0.0,
        "max_retries": 1
    },
    "graph_settings": {
        "allowed_types": [
            "Character", "House", "Location", "Battle",
            "Object", "Creature", "Religion", "Episode",
            "Organization", "Event", "Culture", "Lore"
        ]
    },
    "prompts": {
        "validator_system": "You are an expert...",
        "validator_human": "{input_data}"
    }
}

πŸ“ˆ Statistics & Output

After running the build pipeline, you'll see:

βœ… Heuristic Build Done: 2847 nodes, 2521 docs.
πŸ“Š Total: 2847. Processed: 0. To Validate: 1234
βœ… Batch 1 processed (10 nodes)
...
πŸ’Ύ Validated nodes saved to data/processed/nodes_validated.jsonl
βœ… Edges built: 4562. Skipped by Schema: 237

🚧 Future Enhancements

  • Vector database integration (ChromaDB)
  • RAG query engine
  • FastAPI REST API
  • Neo4j export capability
  • Interactive chatbot interface

πŸ“ License

This project is for educational purposes.

πŸ™ Acknowledgments

  • Game of Thrones Fandom Wiki for the data source
  • LangChain for LLM orchestration
  • Google Gemini for entity validation

πŸ‘€ Author

Diego PΓ‘ez A. - GitHub

About

got-rag-chatbot is a Retrieval-Augmented Generation (RAG) chatbot that leverages large language models (LLMs) and vector databases to provide accurate and context-aware responses based on scraped web content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors