This project is a modular Retrieval-Augmented Generation (RAG) system built with Google DeepMind's - Gemma 3 served locally using Ollama. It allows users to upload documents (PDF, TXT, Markdown etc.), and then chat with the content using natural language queries - all processed through a local setup for privacy and full control.
Designed with modularity and performance in mind, the system handles end-to-end workflows including file ingestion, vector embedding, history summarization, document retrieval, context-aware response generation, and streaming replies to a frontend. It supports multi-file embeddings per user, persistent session history and document storage, and offers live document previews - making it complete end-to-end RAG pipeline useful for educational, or personal assistants.
Check out the live project deployment:
- RAG with Gemma-3
- Project Details
- Tech Stack
- Installation
- Extra Measures
- Future Work
- Contributions
- License
- Contact
The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.
- Make a responsive UI in
Streamlitallowing user to upload documents, get previews to ensure correctness and interact with them. - Use
FastAPIto build a backend that handles file uploads, document processing, user authentication and streaming LLM responses. - Code modular
LLM SystemusingLangChaincomponents for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration. - Integrate locally hosted
Gemma-3LLM usingOllamafor local inference. - Use
FAISSfor efficient vector storage, similarity search and user specific document storage and retrieval. - Use
SQLite-3for user management, authentication, and data control. - Create a dynamic
Dockersetup for easy deployment as either a development or deployment environment. - Deploy project on
Hugging Face Spacesfor easy access and demonstration.
Note
Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses Google Gemini-2.0-Flash-Lite as the LLM backend.
-
User Authentication:
-
UI and User Controls:
- Build a responsive UI using
Streamlitapp. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses.
- User uploaded files and corresponding data are tracked in a SQLite-3 database.
- Allow users to delete their uploaded documents, and manage their session history.

- Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
- Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification.

- UI supports thinking models also to show the LLM's thought process while generating responses.

- Build a responsive UI using
-
User wise document management:
- Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
- Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
-
Embeddings, Vector Storage and Retrieval:
- Implement vector embeddings using
LangChaincomponents to convert documents into vector representations. - Open source
mxbai-embed-largemodel is used for generating embeddings, which is a lightweight and efficient embedding model. - Use
FAISSfor efficient vector storage and retrieval of user-specific + public documents. - Integrate similarity search and document retrieval with Gemma-based LLM responses.
- Implement vector embeddings using
-
FastAPI Backend:
- Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
- Integrate with 'LLM System' module to handle LLM tasks.
- Provide status updates to UI for long running tasks:

- Implement Server-Sent Events (
SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer.
- Provide UI with retrieved documents and metadata for verification of responses.
-
LLM System:
- Modular
LLM SystemusingLangChaincomponents for:- Document Ingestion: Load files and process them into document chunks.
- Vector Embedding: Convert documents into vector representations.
- History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
- Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
- History Management: Maintain session history for context-aware interactions.
- Response Generation: Generate context-aware responses using the LLM.
- Tracing: Enable tracing of LLM interactions using
LangSmithfor debugging and monitoring LLM interactions. - Models: Use
Ollamato run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
- Modular
-
Dockerization:
- Create a dynamic
Dockersetup for easy deployment as either a development or deployment environment. - Use
Dockerfileto manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).
- Create a dynamic
- π¦οΈ LangChain
- β‘ FastAPI
- π Streamlit
- π Docker
- π¦ Ollama
- Gemma-3
- mxbai-embed-large
- βΎοΈ FAISS
- πͺΆ SQLite-3
- π οΈ LangSmith
- π bcrypt
Others:
- π€ Hugging Face Spaces:
- Deploy the project in a Docker container using Dockerfile.
GitHub actions and Branch Protection:
- Process the repository for auto deployment to Hugging Face Spaces.
- Check for any secret leaks in code.
- Fail the commit on any secret leaks.
There are two ways to run this project - either directly using a Virtual Environment or using Dockerfile.
-
Clone the repository:
git clone --depth 1 https://github.com/Bbs1412/rag-with-gemma3.git
-
Create virtual environment and install dependencies:
# Create environment: python -m venv venv # Activate environment: source venv/bin/activate # On Linux/Mac # or venv\Scripts\activate # On Windows # Install dependencies: pip install -r requirements.txt
-
(Optional) If you want to use LangSmith tracing, create a
.envfile in theserverdirectory and add these credentials:# ./server/.env LANGCHAIN_TRACING_V2=true LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" LANGCHAIN_API_KEY="<paste_your_api_key_here>" LANGCHAIN_PROJECT="rag-with-gemma3"
-
Start the FastAPI server:
cd server uvicorn server:app # For development with hot-reloading # uvicorn server:app --reload --port 8000
-
Start the Streamlit server:
cd .. streamlit run app.py -
You can now access these servers:
- FastAPI backend at http://localhost:8000
- Streamlit frontend at http://localhost:8501
- FastAPI Swagger UI at http://localhost:8000/docs
Dockerfile is coded dynamically to support both development and deployment environments.
- Development:
- Project uses
http://host.docker.internal:11434as the Ollama server for local inference. - This is to ensure that existing Ollama models in the host machine are accessible from the docker container.
- In this env, all three ports {8000:FastAPI, 8501:Streamlit, 11434:Ollama} are exposed for easy access.
- Project uses
- Deployment:
- Project uses Google
Gemini-2.0-Flash-Liteas the LLM andtext-embedding-004as embedding model. - Primarily due to deployment and API limitations of Gemma3 model.
- In this env, only port 7860 is exposed for the Streamlit frontend.
- Project uses Google
-
Build the Docker image:
docker build -t BBS/rag-with-gemma3:dev --build-arg ENV_TYPE=dev . -
Create a Docker container:
docker create --name rag-gemma-cont-dev \ -e ENV_TYPE=dev \ # Below 4 are optional env-vars for LangSmith tracing \ -e LANGCHAIN_TRACING_V2=true \ -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \ -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \ -e LANGCHAIN_PROJECT="rag-with-gemma3" \ # Port mapping for FastAPI, Streamlit, and Ollama \ -p 8000:8000 -p 8501:8501 -p 11434:11434 \ BBS/rag-with-gemma3:dev
-
Start the Docker container:
docker start -a rag-gemma-cont-dev
-
You can now access these servers:
- FastAPI backend at http://localhost:8000
- Streamlit frontend at http://localhost:8501
-
Build the Docker image:
docker build -t BBS/rag-with-gemma3:prod --build-arg ENV_TYPE=deploy . -
Create a Docker container:
docker create --name rag-gemma-cont-prod \ -e ENV_TYPE=deploy \ # Below 4 are optional env-vars for LangSmith tracing \ -e LANGCHAIN_TRACING_V2=true \ -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \ -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \ -e LANGCHAIN_PROJECT=deployed-rag-gemma3 \ # This is necessary env variable \ -e GOOGLE_API_KEY="<paste_your_google_api_key_here>" \ # Port mapping, only 7860 is exposed \ -p 7860:7860 \ BBS/rag-with-gemma3:prod -
Start the Docker container:
docker start -a rag-gemma-cont-prod
-
You can now access the Project at http://localhost:7860
- To ensure that user data is persistent and not lost when the container is stopped or removed, you can mount a local directory to the container's storage directory.
- You can do this by adding the
-vflag to thedocker createcommand: -
docker create --name rag-gemma-cont-dev \ -e ENV_TYPE=dev \ -v /path/to/local/storage:/app/storage \ # Other flags...
-
Remove all cache files:
- Linux/Mac:
find . -type d -name "__pycache__" -exec rm -r {} +
- Windows:
Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force
- Linux/Mac:
-
Clear the SQLite database:
python sq_db.py
-
Clear all user data:
# Delete all database indices: rm -rf ./user_faiss/ # Delete all user data: rm -rf ./user_data/
- The Ollama server is configured to run on
http://host.docker.internal:11434by default, which works out-of-the-box on Windows and macOS. - On Linux, Docker does not support host.docker.internal automatically.
- To fix this, add the following flag in the docker create command:
--add-host=host.docker.internal:host-gateway
-
To change LLM or Embedding model:
- Go to
./server/llm_system/config.pyfile. - It is central configuration file for the project.
- Any constant can be changed there to be used in the project.
- There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in
server.py(β line 63)
- Go to
-
To change inference device:
- I have configured the LLM model to work on GPU and embedding model to work on CPU.
- If you want to use GPU for embeddings too, you can change the num_gpu parameter in
./server/llm_system/core/database.py(β line 58).
- 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
- Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.
Note
If you are using docker, make sure to do these changes in ./docker/dev_* files.
- This ensures that relative imports work correctly in the project.
cd server python -m llm_system.utils.loader
- Add support for more file formats like DOCX, PPTX, etc.
- Add web based loading so that any website can be loaded and queried on the go.
- Create docker-compose setup for easier management of multiple containers.
Any contributions or suggestions are welcome!
- This project is licensed under the
GNU General Public License v3.0 - See the LICENSE file for details.
- You can use the code with proper credits to the author.
- Email - bhushanbsongire@gmail.com


