Telegram RAG Chatbot

An intelligent Telegram chatbot powered by Retrieval-Augmented Generation (RAG) that provides contextual responses by searching through a knowledge base of crawled website content. Built with FastAPI, ChromaDB, and Google Gemini AI.

Overview

This project implements a production-ready chatbot system that:

Crawls and indexes website content into a vector database
Uses semantic search to retrieve relevant context
Generates intelligent responses using Google's Gemini AI
Maintains conversation history for context-aware interactions
Integrates seamlessly with Telegram via webhooks

Features

Web Crawling: Automated crawling of websites with configurable depth and domain restrictions
Vector Search: Semantic search using ChromaDB with sentence transformers for accurate context retrieval
RAG Architecture: Combines retrieved context with LLM generation for accurate, source-backed responses
Conversation Memory: SQLite-based conversation history for maintaining context across sessions
Telegram Integration: Native Telegram bot support with webhook-based message handling
RESTful API: FastAPI endpoints for programmatic access and testing
Production Ready: Environment-based configuration, error handling, and scalable architecture

Architecture

┌─────────────┐
│   Telegram  │
│     Bot     │
└──────┬──────┘
       │ Webhook
       ▼
┌─────────────────┐
│   FastAPI       │
│   Server        │
└──────┬──────────┘
       │
       ├──► ChromaDB (Vector Store)
       ├──► SQLite (Conversations)
       └──► Google Gemini API

Components

Crawler (app/crawl.py): Extracts HTML content from websites
Ingestion (app/ingest.py): Processes documents, chunks text, and creates embeddings
Embeddings (app/embeddings.py): Thread-safe sentence transformer wrapper
Server (app/server.py): FastAPI application with RAG logic and Telegram webhook

Prerequisites

Python 3.11 or higher
Google Gemini API key (Get one here)
Telegram Bot Token (Create via @BotFather)

Installation

Clone the repository

git clone <repository-url>
cd telegram-chat-bot

Create a virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Configure environment variables
```
cp ENV_EXAMPLE.txt .env
```
Edit .env and set:
- GEMINI_API_KEY: Your Google Gemini API key
- TELEGRAM_BOT_TOKEN: Your Telegram bot token (optional, for Telegram integration)
- TELEGRAM_WEBHOOK_SECRET: A random string for webhook security (optional)

Usage

Initial Setup

Crawl the target website
```
python -m app.crawl
```
This will crawl the website specified in CRAWL_BASE_URL (default: https://witas.fi/) and save HTML files to the data directory.
Ingest content into vector database
```
python -m app.ingest
```
This processes the crawled content, creates text chunks, generates embeddings, and stores them in ChromaDB.
Reset and rebuild (optional)
```
python reset_and_rebuild.py
```
This script clears all data stores and rebuilds from scratch.

Running the Server

python main.py

The server will start on http://0.0.0.0:8000 (or the port specified in your .env).

API Endpoints

GET /health - Health check endpoint
GET /api/ask?question=<text>&session_id=<id> - Ask a question via GET

POST /api/ask - Ask a question via POST

{
  "question": "What services do you offer?",
  "session_id": "user123"
}

POST /chat - Simplified chat endpoint

{
  "message": "Hello",
  "session_id": "user123"
}

POST /telegram/webhook/<secret> - Telegram webhook endpoint

Telegram Integration

Set up webhook (replace with your domain and secret):

curl -X POST "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/setWebhook" \
  -d "url=https://your-domain.com/telegram/webhook/<YOUR_SECRET>"

Test locally using a tool like ngrok:

ngrok http 8000
# Use the ngrok URL in the webhook setup

Configuration

All configuration is done via environment variables in .env:

Variable	Description	Default
`GEMINI_API_KEY`	Google Gemini API key	Required
`GEMINI_MODEL`	Gemini model to use	`gemini-pro`
`TELEGRAM_BOT_TOKEN`	Telegram bot token	Optional
`TELEGRAM_WEBHOOK_SECRET`	Webhook path secret	Optional
`CHROMA_PATH`	ChromaDB storage path	`./chroma`
`DATA_PATH`	Data directory for crawled content	`./data`
`DB_PATH`	SQLite database path	`./conversations.db`
`COLLECTION_NAME`	ChromaDB collection name	`docs`
`HOST`	Server host	`0.0.0.0`
`PORT`	Server port	`8000`
`CRAWL_BASE_URL`	Base URL to crawl	`https://witas.fi/`
`CRAWL_MAX_PAGES`	Maximum pages to crawl	`60`
`CRAWL_SAME_DOMAIN_ONLY`	Only crawl same domain	`true`
`CRAWL_DELAY_SECONDS`	Delay between requests	`0.5`

Project Structure

telegram-chat-bot/
├── app/
│   ├── __init__.py          # Package initialization
│   ├── crawl.py             # Web crawler module
│   ├── ingest.py            # Document ingestion and vectorization
│   ├── embeddings.py        # Embedding function wrapper
│   └── server.py            # FastAPI application and RAG logic
├── main.py                  # Application entry point
├── reset_and_rebuild.py     # Utility script for data reset
├── requirements.txt         # Python dependencies
├── ENV_EXAMPLE.txt          # Environment variable template
├── general-info.txt         # Static contact information
└── README.md                # This file

How It Works

Content Collection: The crawler visits pages on the target website, extracts HTML, and saves it locally.
Processing: The ingestion module:
- Extracts text from HTML (removes scripts, styles)
- Splits content into semantic chunks (1200 words with 150-word overlap)
- Generates embeddings using sentence transformers
- Stores everything in ChromaDB with metadata (source URLs, file paths)
Query Processing: When a user asks a question:
- The question is embedded using the same model
- ChromaDB performs cosine similarity search to find relevant chunks
- Top 5 most relevant chunks are retrieved
- Conversation history is loaded from SQLite
- A prompt is constructed with context, history, and system instructions
- Google Gemini generates a response
- The conversation is saved to SQLite
Response Delivery: The response is sent back via Telegram or returned via API.

Technologies Used

FastAPI: Modern, fast web framework for building APIs
ChromaDB: Open-source vector database for embeddings
Sentence Transformers: State-of-the-art sentence embeddings
Google Gemini: Large language model for text generation
SQLite: Lightweight database for conversation storage
BeautifulSoup4: HTML parsing and text extraction
Uvicorn: ASGI server for FastAPI

Development

Running Tests

Currently, the project uses manual testing via the API endpoints. You can test locally:

# Start the server
python main.py

# In another terminal, test the API
curl "http://localhost:8000/api/ask?question=What%20services%20do%20you%20offer?"

Code Style

The project follows PEP 8 style guidelines. Consider using:

black for code formatting
flake8 or pylint for linting
mypy for type checking

Troubleshooting

Issue: "GEMINI_API_KEY is not set"

Solution: Ensure your .env file exists and contains a valid GEMINI_API_KEY

Issue: No documents found during ingestion

Solution: Run the crawler first (python -m app.crawl) to populate the data directory

Issue: Telegram webhook not receiving messages

Solution: Verify the webhook URL is correct and accessible. Check that TELEGRAM_BOT_TOKEN is set correctly.

Issue: Poor response quality

Solution: Ensure the vector database is properly populated. Try resetting and rebuilding with python reset_and_rebuild.py

Future Enhancements

Add support for multiple knowledge bases
Implement streaming responses for better UX
Add authentication for API endpoints
Support for file uploads (PDFs, docs) for knowledge base
Admin dashboard for monitoring and analytics
Multi-language support
Rate limiting and usage tracking

Author

Developed as part of a portfolio project demonstrating expertise in:

RAG (Retrieval-Augmented Generation) systems
Vector databases and semantic search
API development with FastAPI
Telegram bot integration
Web scraping and data processing

For questions or issues, please open an issue on the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telegram RAG Chatbot

Overview

Features

Architecture

Components

Prerequisites

Installation

Usage

Initial Setup

Running the Server

API Endpoints

Telegram Integration

Configuration

Project Structure

How It Works

Technologies Used

Development

Running Tests

Code Style

Troubleshooting

Future Enhancements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
.gitignore		.gitignore
ENV_EXAMPLE.txt		ENV_EXAMPLE.txt
README.md		README.md
conversations.db		conversations.db
general-info.txt		general-info.txt
main.py		main.py
requirements.txt		requirements.txt
reset_and_rebuild.py		reset_and_rebuild.py

Folders and files

Latest commit

History

Repository files navigation

Telegram RAG Chatbot

Overview

Features

Architecture

Components

Prerequisites

Installation

Usage

Initial Setup

Running the Server

API Endpoints

Telegram Integration

Configuration

Project Structure

How It Works

Technologies Used

Development

Running Tests

Code Style

Troubleshooting

Future Enhancements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages