`RAG with Gemma-3`

This project is a modular Retrieval-Augmented Generation (RAG) system built with Google DeepMind's - Gemma 3 served locally using Ollama. It allows users to upload documents (PDF, TXT, Markdown etc.), and then chat with the content using natural language queries - all processed through a local setup for privacy and full control.

Designed with modularity and performance in mind, the system handles end-to-end workflows including file ingestion, vector embedding, history summarization, document retrieval, context-aware response generation, and streaming replies to a frontend. It supports multi-file embeddings per user, persistent session history and document storage, and offers live document previews - making it complete end-to-end RAG pipeline useful for educational, or personal assistants.

Check out the live project deployment:

📃 Index:

RAG with Gemma-3
Project Details
- Aim
- Methodology
- Features
Tech Stack
Installation
- Virtual Environment
- Docker
  1. Development
  2. Deployment
Extra Measures
Future Work
Contributions
License
Contact

🎯 Project Details:

Aim

The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.

Methodology

Make a responsive UI in Streamlit allowing user to upload documents, get previews to ensure correctness and interact with them.
Use FastAPI to build a backend that handles file uploads, document processing, user authentication and streaming LLM responses.
Code modular LLM System using LangChain components for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration.
Integrate locally hosted Gemma-3 LLM using Ollama for local inference.
Use FAISS for efficient vector storage, similarity search and user specific document storage and retrieval.
Use SQLite-3 for user management, authentication, and data control.
Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
Deploy project on Hugging Face Spaces for easy access and demonstration.

Note

Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses Google Gemini-2.0-Flash-Lite as the LLM backend.

RAG Samples:

Q: Highest possible grade:
Q: Formatted Output:

Features

User Authentication:
- Authenticate users using SQLite-3 database and bcrypt based password hashing and salt.
- Store user data securely and also auto clear stale sessions data.
UI and User Controls:
- Build a responsive UI using Streamlit app. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses.
- User uploaded files and corresponding data are tracked in a SQLite-3 database.
- Allow users to delete their uploaded documents, and manage their session history.
- Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
- Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification.
- UI supports thinking models also to show the LLM's thought process while generating responses.
User wise document management:
- Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
- Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
Embeddings, Vector Storage and Retrieval:
- Implement vector embeddings using LangChain components to convert documents into vector representations.
- Open source mxbai-embed-large model is used for generating embeddings, which is a lightweight and efficient embedding model.
- Use FAISS for efficient vector storage and retrieval of user-specific + public documents.
- Integrate similarity search and document retrieval with Gemma-based LLM responses.
FastAPI Backend:
- Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
- Integrate with 'LLM System' module to handle LLM tasks.
- Provide status updates to UI for long running tasks:
- Implement Server-Sent Events (SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer.
- Provide UI with retrieved documents and metadata for verification of responses.
LLM System:
- Modular LLM System using LangChain components for:
  1. Document Ingestion: Load files and process them into document chunks.
  2. Vector Embedding: Convert documents into vector representations.
  3. History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
  4. Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
  5. History Management: Maintain session history for context-aware interactions.
  6. Response Generation: Generate context-aware responses using the LLM.
  7. Tracing: Enable tracing of LLM interactions using LangSmith for debugging and monitoring LLM interactions.
  8. Models: Use Ollama to run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
Dockerization:
- Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
- Use Dockerfile to manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).

🧑‍💻 Tech Stack

🦜️ LangChain
⚡ FastAPI
👑 Streamlit
🐋 Docker
🦙 Ollama
- Gemma-3
- mxbai-embed-large
♾️ FAISS
🪶 SQLite-3
🛠️ LangSmith
🔐 bcrypt

Others:

🤗 Hugging Face Spaces:
- Deploy the project in a Docker container using Dockerfile.
GitHub actions and Branch Protection:
- Process the repository for auto deployment to Hugging Face Spaces.
- Check for any secret leaks in code.
- Fail the commit on any secret leaks.

🛠️ Installation

There are two ways to run this project - either directly using a Virtual Environment or using Dockerfile.

Virtual Environment

Clone the repository:

git clone --depth 1 https://github.com/Bbs1412/rag-with-gemma3.git

Create virtual environment and install dependencies:

# Create environment:
python -m venv venv

# Activate environment:
source venv/bin/activate  # On Linux/Mac
# or
venv\Scripts\activate  # On Windows

# Install dependencies:
pip install -r requirements.txt

(Optional) If you want to use LangSmith tracing, create a .env file in the server directory and add these credentials:

# ./server/.env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY="<paste_your_api_key_here>"
LANGCHAIN_PROJECT="rag-with-gemma3"

Start the FastAPI server:

cd server
uvicorn server:app
# For development with hot-reloading
# uvicorn server:app --reload --port 8000

Start the Streamlit server:
```
cd ..
streamlit run app.py
```
You can now access these servers:
- FastAPI backend at http://localhost:8000
- Streamlit frontend at http://localhost:8501
- FastAPI Swagger UI at http://localhost:8000/docs

🐋 Docker

Dockerfile is coded dynamically to support both development and deployment environments.

Development:
- Project uses http://host.docker.internal:11434 as the Ollama server for local inference.
- This is to ensure that existing Ollama models in the host machine are accessible from the docker container.
- In this env, all three ports {8000:FastAPI, 8501:Streamlit, 11434:Ollama} are exposed for easy access.
Deployment:
- Project uses Google Gemini-2.0-Flash-Lite as the LLM and text-embedding-004 as embedding model.
- Primarily due to deployment and API limitations of Gemma3 model.
- In this env, only port 7860 is exposed for the Streamlit frontend.

`Development:`

Build the Docker image:

docker build -t BBS/rag-with-gemma3:dev --build-arg ENV_TYPE=dev .

Create a Docker container:

docker create --name rag-gemma-cont-dev \ 
    -e ENV_TYPE=dev \
    # Below 4 are optional env-vars for LangSmith tracing \
    -e LANGCHAIN_TRACING_V2=true \
    -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \
    -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \
    -e LANGCHAIN_PROJECT="rag-with-gemma3" \
    # Port mapping for FastAPI, Streamlit, and Ollama \
    -p 8000:8000 -p 8501:8501 -p 11434:11434 \
    BBS/rag-with-gemma3:dev

Start the Docker container:
```
docker start -a rag-gemma-cont-dev
```
You can now access these servers:
- FastAPI backend at http://localhost:8000
- Streamlit frontend at http://localhost:8501

`Deployment:`

Build the Docker image:

docker build -t BBS/rag-with-gemma3:prod --build-arg ENV_TYPE=deploy .

Create a Docker container:

docker create --name rag-gemma-cont-prod \
    -e ENV_TYPE=deploy \
    # Below 4 are optional env-vars for LangSmith tracing \
    -e LANGCHAIN_TRACING_V2=true \
    -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \
    -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \
    -e LANGCHAIN_PROJECT=deployed-rag-gemma3 \
    # This is necessary env variable \
    -e GOOGLE_API_KEY="<paste_your_google_api_key_here>" \
    # Port mapping, only 7860 is exposed \
    -p 7860:7860 \
    BBS/rag-with-gemma3:prod

Start the Docker container:
```
docker start -a rag-gemma-cont-prod
```
You can now access the Project at http://localhost:7860

🛡️ Extra Measures

Mount Storage:

To ensure that user data is persistent and not lost when the container is stopped or removed, you can mount a local directory to the container's storage directory.
You can do this by adding the -v flag to the docker create command:

docker create --name rag-gemma-cont-dev \
    -e ENV_TYPE=dev \
    -v /path/to/local/storage:/app/storage \
    # Other flags...

`Reset` Project:

Remove all cache files:

Linux/Mac:

find . -type d -name "__pycache__" -exec rm -r {} +

Windows:

Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force

Clear the SQLite database:
```
python sq_db.py
```

Clear all user data:

# Delete all database indices:
rm -rf ./user_faiss/
# Delete all user data:
rm -rf ./user_data/

Using Linux Host Machine's Ollama on container:

The Ollama server is configured to run on http://host.docker.internal:11434 by default, which works out-of-the-box on Windows and macOS.
On Linux, Docker does not support host.docker.internal automatically.
To fix this, add the following flag in the docker create command:
```
--add-host=host.docker.internal:host-gateway
```

Ollama Models:

To change LLM or Embedding model:
- Go to ./server/llm_system/config.py file.
- It is central configuration file for the project.
- Any constant can be changed there to be used in the project.
- There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in server.py (≈ line 63)
To change inference device:
- I have configured the LLM model to work on GPU and embedding model to work on CPU.
- If you want to use GPU for embeddings too, you can change the num_gpu parameter in ./server/llm_system/core/database.py (≈ line 58).
- 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
- Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.

Note

If you are using docker, make sure to do these changes in ./docker/dev_* files.

To test some sub-components:

This ensures that relative imports work correctly in the project.
```
cd server
python -m llm_system.utils.loader
```

🚀 Future Work

Add support for more file formats like DOCX, PPTX, etc.
Add web based loading so that any website can be loaded and queried on the go.
Create docker-compose setup for easier management of multiple containers.

🤝 Contributions

Any contributions or suggestions are welcome!

📜 License

This project is licensed under the GNU General Public License v3.0
See the LICENSE file for details.
You can use the code with proper credits to the author.

📧 Contact

Email - bhushanbsongire@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
Docs		Docs
assets		assets
docker		docker
server		server
test_apps		test_apps
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
HF_README.md		HF_README.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`RAG with Gemma-3`

📃 Index:

🎯 Project Details:

Aim

Methodology

RAG Samples:

Features

🧑‍💻 Tech Stack

🛠️ Installation

Virtual Environment

🐋 Docker

`Development:`

`Deployment:`

🛡️ Extra Measures

Mount Storage:

`Reset` Project:

Using Linux Host Machine's Ollama on container:

Ollama Models:

To test some sub-components:

🚀 Future Work

🤝 Contributions

📜 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG with Gemma-3

📃 Index:

🎯 Project Details:

Aim

Methodology

RAG Samples:

Features

🧑‍💻 Tech Stack

🛠️ Installation

Virtual Environment

🐋 Docker

Development:

Deployment:

🛡️ Extra Measures

Mount Storage:

Reset Project:

Using Linux Host Machine's Ollama on container:

Ollama Models:

To test some sub-components:

🚀 Future Work

🤝 Contributions

📜 License

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`RAG with Gemma-3`

`Development:`

`Deployment:`

`Reset` Project:

Packages