This repository contains code for implementing a Retrieval-Augmented Generation (RAG) system for document processing, with and without using the Langchain framework. It includes scripts for creating a vector database, processing documents, and performing question answering. Document processing is performed using Docling and the vector database is created using the PGVector extension for PostgreSQL.
The project structure is as follows:
.env: Stores environment variables for database connection and other configurations..gitignore: Specifies intentionally untracked files that Git should ignore.create_vector_db_without_langchain.py: Script to create the vector database without Langchain.failed_files_log.csv: Log file for files that failed during processing.rag_db_without_langchain_100_rows.csv: Example output of vector db when created without Langchain.testing_langchain_rag.py: Script to create vector db with Langchain.annotated_rag_documents/: Directory containing annotated PDF documents.test_rag_documents/: Directory containing documents used for testing.
- Configuration: The
Configclass increate_vector_db_without_langchain.pyhandles configuration parameters loaded from environment variables. - Document Processing: The
DocumentProcessorclass intesting_langchain_rag.pyhandles loading models, processing files, chunking, embedding, and storing documents in the vector store. - Vector Database: The project uses PGVector, a PostgreSQL extension for vector similarity search.
-
Create the vector database:
python create_vector_db_without_langchain.py
This script reads documents from the directory specified by the
DOCUMENT_DIRECTORYenvironment variable (default:test_rag_documents), processes them, and stores the embeddings in the PGVector database. -
Test the RAG implementation:
python testing_langchain_rag.py
This script provides functionality to test the RAG system, query the database, and evaluate the results.
The following environment variables are used to configure the application:
DB_HOST: Hostname of the PostgreSQL database server (default:localhost).DB_PORT: Port number of the PostgreSQL database server (default:5432).DB_NAME: Name of the PostgreSQL database (default:rag_lyme_docs).DB_USER: Username for connecting to the PostgreSQL database (default:postgres).DB_PASSWORD: Password for connecting to the PostgreSQL database.DOCUMENT_DIRECTORY: Directory containing the documents to be processed (default:rag_documents).DOCUMENT_IMAGE_DIRECTORY: Directory to store document images (default:rag_document_images).MODEL_ID: Identifier of the embedding model to use (default:intfloat/multilingual-e5-large-instruct).ENCODING_BATCH_SIZE: Batch size for encoding documents (default:32).DB_INSERT_BATCH_SIZE: Batch size for inserting data into the database (default: 100).FAILED_FILES_LOG: Log file for failed files (default:failed_files_log.csv).