Form Processing RAG App

A Retrieval-Augmented Generation (RAG) pipeline built with Streamlit, PostgreSQL (with pgvector), and Llama 3 via Groq API. This app allows users to upload documents, extract and chunk text, embed with sentence-transformers, store in PostgreSQL, and retrieve relevant chunks for LLM-based Q&A.

Features

Document Upload: Supports PDF, DOCX, images, and text files.
Text Extraction: OCR for images, parsing for PDFs/DOCX/text.
Chunking & Embedding: Splits text and generates 384-dim embeddings using sentence-transformers.
Vector Database: Stores embeddings in PostgreSQL with pgvector extension.
Semantic Search: Retrieves relevant chunks using vector similarity.
LLM Q&A: Uses Llama 3 via Groq API for question answering over retrieved context.
Named Entity Recognition: Extracts and displays entities in both JSON and human-readable formats.
Data Management: Sidebar button to clear all data (for research phase).

Tech Stack

Frontend: Streamlit
Backend: Python
Database: PostgreSQL (with pgvector, e.g., Supabase)
Embeddings: sentence-transformers
LLM: Llama 3 via Groq API
OCR: Mistral API (for images)

Setup

Clone the repository:

git clone https://github.com/Daramanohar/rag-docs-engine/tree/main
cd rag-docs-engine

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

Add your API keys and database credentials to .streamlit/secrets.toml or the Streamlit Cloud secrets UI:

MISTRAL_API_KEY = "your-mistral-key"
GROQ_API_KEY = "your-groq-key"
DB_HOST = "your-db-host"
DB_NAME = "your-db-name"
DB_USER = "your-db-user"
DB_PASSWORD = "your-db-password"
DB_PORT = "5432"

Enable pgvector extension in your PostgreSQL database (if not already enabled):
```
CREATE EXTENSION IF NOT EXISTS vector;
```
Run the app:
```
streamlit run streamlit_app.py
```

Usage

Upload a document (PDF, DOCX, image, or text).
The app extracts and chunks the text, generates embeddings, and stores them in the database.
Ask questions about the document in the chatbot tab.
View named entities in both JSON and human-readable formats.
Use the sidebar to clear all data (for research/testing phase).

Deployment

Deploy on Streamlit Cloud or your own server.
Set secrets in the Streamlit Cloud UI for API keys and DB credentials.
Make sure your database is accessible from the deployment environment.

Notes

Research Phase: The "Clear All Data" button wipes the entire database for all users. This is simple for prototyping, but not suitable for production.
Production Recommendation: For multi-user support, store a user/session ID with each chunk and filter queries/clearing by this ID.
OCR Model: Requires Mistral API for image OCR.
LLM Model: Uses Llama 3 via Groq API for Q&A.

Future Improvements

User/session-specific chunk storage and clearing
User authentication
More robust error handling and logging
Support for additional file types and languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
attached_assets		attached_assets
formp		formp
modules		modules
RAG_Pipeline_Documentation.md		RAG_Pipeline_Documentation.md
README.md		README.md
__init__.py		__init__.py
app.log		app.log
mdata		mdata
postinstall.sh		postinstall.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_local.py		run_local.py
setup_instructions.md		setup_instructions.md
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Form Processing RAG App

Features

Tech Stack

Setup

Usage

Deployment

Notes

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Form Processing RAG App

Features

Tech Stack

Setup

Usage

Deployment

Notes

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages