RAG Implementation for Document Processing

This repository contains code for implementing a Retrieval-Augmented Generation (RAG) system for document processing, with and without using the Langchain framework. It includes scripts for creating a vector database, processing documents, and performing question answering. Document processing is performed using Docling and the vector database is created using the PGVector extension for PostgreSQL.

Project Structure

The project structure is as follows:

.env: Stores environment variables for database connection and other configurations.
.gitignore: Specifies intentionally untracked files that Git should ignore.
create_vector_db_without_langchain.py: Script to create the vector database without Langchain.
failed_files_log.csv: Log file for files that failed during processing.
rag_db_without_langchain_100_rows.csv: Example output of vector db when created without Langchain.
testing_langchain_rag.py: Script to create vector db with Langchain.
annotated_rag_documents/: Directory containing annotated PDF documents.
test_rag_documents/: Directory containing documents used for testing.

Key Components

Configuration: The Config class in create_vector_db_without_langchain.py handles configuration parameters loaded from environment variables.
Document Processing: The DocumentProcessor class in testing_langchain_rag.py handles loading models, processing files, chunking, embedding, and storing documents in the vector store.
Vector Database: The project uses PGVector, a PostgreSQL extension for vector similarity search.

Usage

Create the vector database:
```
python create_vector_db_without_langchain.py
```
This script reads documents from the directory specified by the DOCUMENT_DIRECTORY environment variable (default: test_rag_documents), processes them, and stores the embeddings in the PGVector database.
Test the RAG implementation:
```
python testing_langchain_rag.py
```
This script provides functionality to test the RAG system, query the database, and evaluate the results.

Environment Variables

The following environment variables are used to configure the application:

DB_HOST: Hostname of the PostgreSQL database server (default: localhost).
DB_PORT: Port number of the PostgreSQL database server (default: 5432).
DB_NAME: Name of the PostgreSQL database (default: rag_lyme_docs).
DB_USER: Username for connecting to the PostgreSQL database (default: postgres).
DB_PASSWORD: Password for connecting to the PostgreSQL database.
DOCUMENT_DIRECTORY: Directory containing the documents to be processed (default: rag_documents).
DOCUMENT_IMAGE_DIRECTORY: Directory to store document images (default: rag_document_images).
MODEL_ID: Identifier of the embedding model to use (default: intfloat/multilingual-e5-large-instruct).
ENCODING_BATCH_SIZE: Batch size for encoding documents (default: 32).
DB_INSERT_BATCH_SIZE: Batch size for inserting data into the database (default: 100).
FAILED_FILES_LOG: Log file for failed files (default: failed_files_log.csv).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
annotated_rag_documents		annotated_rag_documents
test_rag_documents		test_rag_documents
test_rag_documents_annotated		test_rag_documents_annotated
.env-example		.env-example
.gitignore		.gitignore
README.md		README.md
create_vector_db_with_langchain.py		create_vector_db_with_langchain.py
create_vector_db_without_langchain.py		create_vector_db_without_langchain.py
failed_files_log.csv		failed_files_log.csv
pgvector_vectorstore.ipynb		pgvector_vectorstore.ipynb
rag_db_with_langchain_100_rows.csv		rag_db_with_langchain_100_rows.csv
rag_db_without_langchain_100_rows.csv		rag_db_without_langchain_100_rows.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Implementation for Document Processing

Project Structure

Key Components

Usage

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Implementation for Document Processing

Project Structure

Key Components

Usage

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages