Skip to content

Bbs1412/rag-with-gemma3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

86 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG with Gemma-3

This project is a modular Retrieval-Augmented Generation (RAG) system built with Google DeepMind's - Gemma 3 served locally using Ollama. It allows users to upload documents (PDF, TXT, Markdown etc.), and then chat with the content using natural language queries - all processed through a local setup for privacy and full control.

Designed with modularity and performance in mind, the system handles end-to-end workflows including file ingestion, vector embedding, history summarization, document retrieval, context-aware response generation, and streaming replies to a frontend. It supports multi-file embeddings per user, persistent session history and document storage, and offers live document previews - making it complete end-to-end RAG pipeline useful for educational, or personal assistants.

Check out the live project deployment: HuggingFace Space Deployment Link

πŸ“ƒ Index:

🎯 Project Details:

Aim

The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.

Methodology

  1. Make a responsive UI in Streamlit allowing user to upload documents, get previews to ensure correctness and interact with them.
  2. Use FastAPI to build a backend that handles file uploads, document processing, user authentication and streaming LLM responses.
  3. Code modular LLM System using LangChain components for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration.
  4. Integrate locally hosted Gemma-3 LLM using Ollama for local inference.
  5. Use FAISS for efficient vector storage, similarity search and user specific document storage and retrieval.
  6. Use SQLite-3 for user management, authentication, and data control.
  7. Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
  8. Deploy project on Hugging Face Spaces for easy access and demonstration.

Note

Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses Google Gemini-2.0-Flash-Lite as the LLM backend.

RAG Samples:

  • Q: Highest possible grade: RAG Sample Q1 RAG Sample A1
  • Q: Formatted Output: RAG Sample Q2

Features

  • User Authentication:

    • Authenticate users using SQLite-3 database and bcrypt based password hashing and salt. User Registration Screenshot
    • Store user data securely and also auto clear stale sessions data.
  • UI and User Controls:

    • Build a responsive UI using Streamlit app. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses. User File Preview
    • User uploaded files and corresponding data are tracked in a SQLite-3 database.
    • Allow users to delete their uploaded documents, and manage their session history. User Chat Screenshot
    • Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
    • Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification. Source Documents Screenshot
    • UI supports thinking models also to show the LLM's thought process while generating responses. Thinking Model Screenshot
  • User wise document management:

    • Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
    • Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
  • Embeddings, Vector Storage and Retrieval:

    • Implement vector embeddings using LangChain components to convert documents into vector representations.
    • Open source mxbai-embed-large model is used for generating embeddings, which is a lightweight and efficient embedding model.
    • Use FAISS for efficient vector storage and retrieval of user-specific + public documents.
    • Integrate similarity search and document retrieval with Gemma-based LLM responses.
  • FastAPI Backend:

    • Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
    • Integrate with 'LLM System' module to handle LLM tasks.
    • Provide status updates to UI for long running tasks: Step By Step Updates Screenshot
    • Implement Server-Sent Events (SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer. SSE Streaming Screenshot
    • Provide UI with retrieved documents and metadata for verification of responses.
  • LLM System:

    • Modular LLM System using LangChain components for:
      1. Document Ingestion: Load files and process them into document chunks.
      2. Vector Embedding: Convert documents into vector representations.
      3. History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
      4. Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
      5. History Management: Maintain session history for context-aware interactions.
      6. Response Generation: Generate context-aware responses using the LLM.
      7. Tracing: Enable tracing of LLM interactions using LangSmith for debugging and monitoring LLM interactions.
      8. Models: Use Ollama to run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
  • Dockerization:

    • Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
    • Use Dockerfile to manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).

πŸ§‘β€πŸ’» Tech Stack

  • 🦜️ LangChain
  • ⚑ FastAPI
  • πŸ‘‘ Streamlit
  • πŸ‹ Docker
  • πŸ¦™ Ollama
    • Gemma-3
    • mxbai-embed-large
  • ♾️ FAISS
  • πŸͺΆ SQLite-3
  • πŸ› οΈ LangSmith
  • πŸ” bcrypt

Others:

  • πŸ€— Hugging Face Spaces:
    • Deploy the project in a Docker container using Dockerfile.
  • :octocat: GitHub actions and Branch Protection:
    • Process the repository for auto deployment to Hugging Face Spaces.
    • Check for any secret leaks in code.
    • Fail the commit on any secret leaks.

πŸ› οΈ Installation

There are two ways to run this project - either directly using a Virtual Environment or using Dockerfile.

Virtual Environment

  1. Clone the repository:

    git clone --depth 1 https://github.com/Bbs1412/rag-with-gemma3.git
  2. Create virtual environment and install dependencies:

    # Create environment:
    python -m venv venv
    
    # Activate environment:
    source venv/bin/activate  # On Linux/Mac
    # or
    venv\Scripts\activate  # On Windows
    
    # Install dependencies:
    pip install -r requirements.txt
  3. (Optional) If you want to use LangSmith tracing, create a .env file in the server directory and add these credentials:

    # ./server/.env
    LANGCHAIN_TRACING_V2=true
    LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
    LANGCHAIN_API_KEY="<paste_your_api_key_here>"
    LANGCHAIN_PROJECT="rag-with-gemma3"
  4. Start the FastAPI server:

    cd server
    uvicorn server:app
    # For development with hot-reloading
    # uvicorn server:app --reload --port 8000
  5. Start the Streamlit server:

    cd ..
    streamlit run app.py
  6. You can now access these servers:

πŸ‹ Docker

Dockerfile is coded dynamically to support both development and deployment environments.

  1. Development:
    • Project uses http://host.docker.internal:11434 as the Ollama server for local inference.
    • This is to ensure that existing Ollama models in the host machine are accessible from the docker container.
    • In this env, all three ports {8000:FastAPI, 8501:Streamlit, 11434:Ollama} are exposed for easy access.
  2. Deployment:
    • Project uses Google Gemini-2.0-Flash-Lite as the LLM and text-embedding-004 as embedding model.
    • Primarily due to deployment and API limitations of Gemma3 model.
    • In this env, only port 7860 is exposed for the Streamlit frontend.

Development:

  1. Build the Docker image:

    docker build -t BBS/rag-with-gemma3:dev --build-arg ENV_TYPE=dev .
  2. Create a Docker container:

    docker create --name rag-gemma-cont-dev \ 
        -e ENV_TYPE=dev \
        # Below 4 are optional env-vars for LangSmith tracing \
        -e LANGCHAIN_TRACING_V2=true \
        -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \
        -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \
        -e LANGCHAIN_PROJECT="rag-with-gemma3" \
        # Port mapping for FastAPI, Streamlit, and Ollama \
        -p 8000:8000 -p 8501:8501 -p 11434:11434 \
        BBS/rag-with-gemma3:dev
  3. Start the Docker container:

    docker start -a rag-gemma-cont-dev
  4. You can now access these servers:

Deployment:

  1. Build the Docker image:

    docker build -t BBS/rag-with-gemma3:prod --build-arg ENV_TYPE=deploy .
  2. Create a Docker container:

    docker create --name rag-gemma-cont-prod \
        -e ENV_TYPE=deploy \
        # Below 4 are optional env-vars for LangSmith tracing \
        -e LANGCHAIN_TRACING_V2=true \
        -e LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" \
        -e LANGCHAIN_API_KEY="<paste_your_api_key_here>" \
        -e LANGCHAIN_PROJECT=deployed-rag-gemma3 \
        # This is necessary env variable \
        -e GOOGLE_API_KEY="<paste_your_google_api_key_here>" \
        # Port mapping, only 7860 is exposed \
        -p 7860:7860 \
        BBS/rag-with-gemma3:prod
  3. Start the Docker container:

    docker start -a rag-gemma-cont-prod
  4. You can now access the Project at http://localhost:7860

πŸ›‘οΈ Extra Measures

Mount Storage:

  • To ensure that user data is persistent and not lost when the container is stopped or removed, you can mount a local directory to the container's storage directory.
  • You can do this by adding the -v flag to the docker create command:
  • docker create --name rag-gemma-cont-dev \
        -e ENV_TYPE=dev \
        -v /path/to/local/storage:/app/storage \
        # Other flags...

Reset Project:

  1. Remove all cache files:

    • Linux/Mac:
      find . -type d -name "__pycache__" -exec rm -r {} +
    • Windows:
      Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force
  2. Clear the SQLite database:

    python sq_db.py
  3. Clear all user data:

    # Delete all database indices:
    rm -rf ./user_faiss/
    # Delete all user data:
    rm -rf ./user_data/

Using Linux Host Machine's Ollama on container:

  • The Ollama server is configured to run on http://host.docker.internal:11434 by default, which works out-of-the-box on Windows and macOS.
  • On Linux, Docker does not support host.docker.internal automatically.
  • To fix this, add the following flag in the docker create command:
    --add-host=host.docker.internal:host-gateway

Ollama Models:

  • To change LLM or Embedding model:

    • Go to ./server/llm_system/config.py file.
    • It is central configuration file for the project.
    • Any constant can be changed there to be used in the project.
    • There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in server.py (β‰ˆ line 63)
  • To change inference device:

    • I have configured the LLM model to work on GPU and embedding model to work on CPU.
    • 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
    • Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.

Note

If you are using docker, make sure to do these changes in ./docker/dev_* files.

To test some sub-components:

  • This ensures that relative imports work correctly in the project.
    cd server
    python -m llm_system.utils.loader

πŸš€ Future Work

  • Add support for more file formats like DOCX, PPTX, etc.
  • Add web based loading so that any website can be loaded and queried on the go.
  • Create docker-compose setup for easier management of multiple containers.

🀝 Contributions

Any contributions or suggestions are welcome!

πŸ“œ License

Code-License

  • This project is licensed under the GNU General Public License v3.0
  • See the LICENSE file for details.
  • You can use the code with proper credits to the author.

πŸ“§ Contact

About

This project is a modular Retrieval-Augmented Generation (RAG) system built with Google DeepMind's - Gemma 3 served locally using Ollama.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors