A sophisticated Knowledge Graph and Retrieval-Augmented Generation (RAG) system for Game of Thrones lore. This project combines web scraping, heuristic entity extraction, LLM-powered validation, and graph construction to create an intelligent, context-aware knowledge base.
- Intelligent Web Scraping: Automated extraction from Game of Thrones Fandom Wiki with resume capability
- Hybrid Knowledge Graph Construction:
- Heuristic-based entity extraction and type classification
- Optional LLM validation (Gemini) for improved accuracy
- Schema-aware relationship building with business logic constraints
- Advanced Text Processing: WikiText parsing, infobox extraction, and intelligent text cleaning
- Resumable Pipeline: Checkpoint system allows interrupted processes to resume without data loss
- Batch Processing: Efficient handling of large datasets with configurable batch sizes
- Error Handling: Robust retry logic with exponential backoff for API rate limits
- Extensible Architecture: Modular design with clear separation of concerns
π Project Structure
got-rag-chatbot/
β
βββ .env # Environment variables (GOOGLE_API_KEY for Gemini)
βββ .gitignore
βββ README.md
βββ requirements.txt
βββ pyproject.toml # Project metadata and dependencies
β
βββ cfg/ # Configuration files
β βββ config.json # LLM settings, prompts, and schema definitions
β
βββ data/ # Data storage (gitignored)
β βββ raw/
β β βββ wiki_dump.jsonl # Raw scraped data from wiki
β βββ processed/ # Generated knowledge graph files
β β βββ nodes.jsonl # Heuristic nodes
β β βββ nodes_validated.jsonl # LLM-validated nodes
β β βββ nodes_llm_checkpoint.jsonl # Validation checkpoint
β β βββ edges.jsonl # Graph relationships
β β βββ documents.jsonl # Text documents for RAG
β βββ chromadb/ # Vector database (future feature)
β
βββ src/ # Main application source code
β βββ __init__.py
β βββ config.py # Configuration loader
β β
β βββ core/ # Core components
β β βββ database.py # Vector DB connection logic
β β βββ llm.py # LLM Client setup
β β
β βββ ingestion/ # Data extraction and processing
β β βββ scraper.py # Fandom Wiki scraper with resume capability
β β βββ processor.py # Text processing and cleaning
β β βββ loader.py # Database loading logic
β β
β βββ graph/ # Knowledge Graph construction
β β βββ builder.py # Heuristic node extraction and typing
β β βββ validator.py # LLM-powered node validation
β β βββ edge_builder.py # Schema-aware relationship extraction
β β
β βββ utils/ # Utility functions
β β βββ text.py # Text cleaning and normalization
β β
β βββ rag/ # RAG system (future enhancement)
β β βββ retriever.py
β β βββ chain.py
β β βββ engine.py
β β
β βββ schemas/ # Pydantic models
β β βββ chat.py
β β βββ document.py
β β
β βββ api/ # FastAPI web layer (future feature)
β βββ main.py
β βββ routes.py
β βββ dependencies.py
β
βββ main.py # CLI orchestrator- Python 3.8+
- uv (recommended) or pip
- Google Gemini API Key (for LLM validation feature)
- Clone the repository
git clone https://github.com/DiegoPaezA/got-rag-chatbot.git
cd got-rag-chatbot- Install dependencies
# Install dependencies and create virtual environment
uv sync
# Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Set up environment variables
Create a .env file in the project root:
GOOGLE_API_KEY=your_gemini_api_key_hereEdit cfg/config.json to customize:
- LLM settings (model, temperature, retries)
- Entity types and validation prompts
- Schema constraints for relationships
Download raw data from Game of Thrones Fandom Wiki:
python main.py scrapeFeatures:
- Automatically resumes if interrupted
- Avoids duplicates using ID tracking
- Progress bar with real-time statistics
- Polite rate limiting (0.1s delay between requests)
Output: data/raw/wiki_dump.jsonl
Extract entities and relationships from raw data:
# Build with heuristics only (fast)
python main.py build --use_heuristic
# Build with LLM validation (more accurate, requires API key)
python main.py build --use-llm --clean-llm3.1. (Run the Neo4j instance)
docker-compose -f docker-compose.yml up -d3.2. Start the RAG Chatbot
Run the Streamlit app:
uv run streamlit run src/app.pyRun the CLI chatbot:
uv run python test_rag.pyPipeline Steps:
- Parses WikiText and extracts infoboxes
- Applies scoring system to classify entities (Character, House, Location, etc.)
- Generates confidence scores (High/Medium/Low)
- Output:
data/processed/nodes.jsonl
- Validates low-confidence and ambiguous nodes
- Uses Google Gemini with structured output
- Batch processing with checkpoint system
- Handles API rate limits with exponential backoff
- Output:
data/processed/nodes_validated.jsonl
- Extracts relationships from node properties
- Validates edges against schema constraints
- Prevents semantic errors (e.g., "Sword is father of Character")
- Deduplicates relationships
- Output:
data/processed/edges.jsonl
βββββββββββββββ
β Wiki API β
ββββββββ¬βββββββ
β Scraper (resumable)
βΌ
βββββββββββββββββββ
β wiki_dump.jsonl β
ββββββββββ¬βββββββββ
β Builder (heuristic scoring)
βΌ
βββββββββββ ββββββββββββββββ
β nodes ββββββββββΆβ documents β
ββββββ¬βββββ ββββββββββββββββ
β (for RAG)
β Validator (LLM)
βΌ
ββββββββββββββββββββ
β nodes_validated β
ββββββββββ¬ββββββββββ
β EdgeBuilder (schema-aware)
βΌ
ββββββββββ
β edges β
ββββββββββ- MediaWiki API integration
- Checkpoint-based resume capability
- WikiText parsing with mwparserfromhell
- Infobox extraction
- Heuristic type classification using scoring system
- Property extraction and cleaning
- Document generation for RAG
- LLM-powered validation with Gemini
- Batch processing with configurable size
- Checkpoint system for long-running validations
- Intelligent retry logic for API failures
- Validates nodes with:
- Low confidence scores
- Ambiguous types (Lore, Organization, Object)
- Close type scores (within 2 points)
- Property-to-relationship mapping
- Schema constraint validation
- Supports multiple relationship types:
- Family (CHILD_OF, PARENT_OF, SIBLING_OF, MARRIED_TO)
- Loyalty (BELONGS_TO, SWORN_TO, VASSAL_OF)
- Geography (LOCATED_IN, SEATED_AT)
- Culture (HAS_CULTURE, FOLLOWS_RELIGION)
- War (PARTICIPANT_IN, COMMANDED_BY)
- Objects (OWNED_BY, WIELDED_BY, CREATED_BY)
- Meta (PLAYED_BY, APPEARED_IN_SEASON)
The system enforces semantic correctness through schema constraints. For example:
{
"CHILD_OF": ["Character", "Creature"], # Only these types can have parents
"MARRIED_TO": ["Character"], # Only characters can marry
"SEATED_AT": ["House"], # Only houses have seats
"OWNS_WEAPON": ["Character"] # Only characters own weapons
}This prevents illogical relationships like "Ice (sword) is the father of Jon Snow".
{
"id": "Jon_Snow",
"type": "Character",
"confidence": "High (LLM)",
"reason": "Has actor, born properties",
"type_scores": {"Character": 5, "House": 0, ...},
"properties": {
"Father": "Rhaegar Targaryen",
"House": "Stark",
"Actor": "Kit Harington"
},
"url": "https://gameofthrones.fandom.com/wiki/Jon_Snow"
}{
"source": "Jon_Snow",
"relation": "CHILD_OF",
"target": "Rhaegar_Targaryen"
}{
"id": "Jon_Snow",
"text": "Jon Snow is a Character. Jon Snow is the son of...",
"metadata": {
"type": "Character",
"source": "wiki_dump"
}
}Both scraper and validator support resuming interrupted runs:
- Scraper: Tracks processed article IDs in the output file
- Validator: Uses checkpoint file to track validated nodes
Simply re-run the same command to resume where it left off.
The validator processes nodes in configurable batches (default: 10):
- Reduces API calls
- Enables checkpoint saves between batches
- Better error recovery
- 429 Rate Limits: Exponential backoff with jitter (4s β 8s β 16s β ...)
- Network Errors: Automatic retry with delay
- Parse Errors: Graceful skip with logging
- Schema Violations: Tracked and reported in statistics
{
"llm_settings": {
"model_name": "gemini-2.5-flash",
"temperature": 0.0,
"max_retries": 1
},
"graph_settings": {
"allowed_types": [
"Character", "House", "Location", "Battle",
"Object", "Creature", "Religion", "Episode",
"Organization", "Event", "Culture", "Lore"
]
},
"prompts": {
"validator_system": "You are an expert...",
"validator_human": "{input_data}"
}
}After running the build pipeline, you'll see:
β
Heuristic Build Done: 2847 nodes, 2521 docs.
π Total: 2847. Processed: 0. To Validate: 1234
β
Batch 1 processed (10 nodes)
...
πΎ Validated nodes saved to data/processed/nodes_validated.jsonl
β
Edges built: 4562. Skipped by Schema: 237
- Vector database integration (ChromaDB)
- RAG query engine
- FastAPI REST API
- Neo4j export capability
- Interactive chatbot interface
This project is for educational purposes.
- Game of Thrones Fandom Wiki for the data source
- LangChain for LLM orchestration
- Google Gemini for entity validation
Diego PΓ‘ez A. - GitHub