Smart RAGs (Retrieval-Augmented Generation) enhance LLM requests with relevant content from your document collections. Index your documents once, then every request automatically retrieves and injects relevant context.
- Documents indexed into ChromaDB as vector embeddings
- Request arrives with model name set to your Smart RAG
- Query embedded using the same embedding model
- Semantic search finds relevant document chunks
- Reranking improves relevance ordering with cross-encoder
- Context injection adds chunks to the system prompt
- Target model receives the augmented request
Client Request → Smart RAG → Embed Query → ChromaDB Search → Rerank
↓ ↓ ↓
nomic-embed-text Vector Search Cross-Encoder
↓
Inject into System Prompt
↓
Target Model
- ChromaDB: Required for vector storage (
CHROMA_URLenvironment variable)
In the Admin UI:
- Go to Smart RAGs in the sidebar
- Click New RAG
- Configure:
| Field | Description |
|---|---|
| Name | The model name clients will use (e.g., docs-assistant) |
| Target Model | The underlying model to call with document context |
| Source Path | Path to your documents folder (inside container) |
| Embedding Provider | Local, Ollama, or cloud provider |
| Embedding Model | Model for creating embeddings |
| Max Results | Number of chunks to retrieve |
| Similarity Threshold | Minimum relevance score (0-1) |
| Max Context Tokens | Token limit for injected context |
- Click Save
- Click Index Now to process your documents
RAG Name: docs-assistant
Target Model: anthropic/claude-sonnet-4-5
Source Path: /data/documents
Embedding Provider: ollama:winpc
Embedding Model: nomic-embed-text:latest
Max Results: 5
Similarity Threshold: 0.7
Max Context Tokens: 4000
Now use it:
curl http://localhost:11434/api/chat \
-d '{"model": "docs-assistant", "messages": [{"role": "user", "content": "What does the Q3 report say about revenue?"}]}'The RAG will:
- Embed your query
- Search for relevant chunks in indexed documents
- Rerank results for better relevance
- Inject top chunks into Claude's context
- Return Claude's response informed by your documents
Mount your document folder into the container:
# docker-compose.yml
services:
llm-relay:
volumes:
- ./my-documents:/data/documentsThen set Source Path to /data/documents.
LLM Relay uses Docling for document parsing:
- PDF - Full text extraction with layout preservation
- DOCX - Microsoft Word documents
- PPTX - PowerPoint presentations
- HTML - Web pages
- Markdown -
.mdfiles - Images - PNG, JPG (requires vision model)
Uses bundled sentence-transformers. No external dependencies.
Pros: Fast, free, private Cons: Less accurate than larger models
Uses your Ollama instance for embeddings.
Setup:
# On your Ollama instance
ollama pull nomic-embed-textConfiguration:
- Embedding Provider:
ollama:<instance-name> - Embedding Model:
nomic-embed-text:latest - Ollama URL:
http://your-ollama:11434
Pros: Good balance of quality and speed Cons: Requires Ollama instance
Use any configured provider with embedding support (OpenAI, etc.):
- Embedding Provider:
openai - Embedding Model:
text-embedding-3-small
Pros: High quality embeddings Cons: API costs, data leaves your network
For complex PDFs with tables, charts, or images, configure a vision model to improve parsing:
- In Admin UI, go to Settings
- In Web Search & Scraping section, configure:
- Vision Provider:
ollamaor cloud provider - Vision Model:
granite3.2-vision:latestor similar
- Vision Provider:
The vision model will be used during indexing to extract content from complex document pages.
Smart RAGs automatically rerank retrieved chunks using a cross-encoder model. This significantly improves relevance by comparing query-document pairs directly.
Default Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (~48MB)
The reranker:
- Takes initial results from vector search
- Scores each chunk against the query
- Reorders by relevance
- Returns top results
For API-based reranking:
- Rerank Provider:
jina - Set
JINA_API_KEYenvironment variable
- Go to Smart RAGs in Admin UI
- Click on your RAG
- Click Index Now
Indexing processes all documents in the source folder and creates embeddings.
Configure automatic re-indexing with cron syntax:
| Schedule | Cron Expression |
|---|---|
| Every hour | 0 * * * * |
| Daily at midnight | 0 0 * * * |
| Weekly on Sunday | 0 0 * * 0 |
Set the Index Schedule field in the RAG configuration.
| Status | Meaning |
|---|---|
ready |
Indexed and available |
indexing |
Currently processing |
error |
Indexing failed (check logs) |
pending |
Not yet indexed |
Control how documents are split:
- Chunk Size: Characters per chunk (default: 512)
- Chunk Overlap: Overlap between chunks (default: 50)
Smaller chunks = more precise retrieval but less context per chunk. Larger chunks = more context but may include irrelevant content.
Minimum relevance score for chunks to be included.
| Value | Behavior |
|---|---|
0.9 |
Only very relevant chunks |
0.7 |
Moderately relevant (recommended) |
0.5 |
Loosely relevant |
Start with 0.7 and adjust based on retrieval quality.
Number of chunks to retrieve and inject.
| Use Case | Recommended |
|---|---|
| Quick answers | 3-5 |
| Detailed research | 5-10 |
| Comprehensive context | 10-15 |
More chunks = more context but higher token usage.
Retrieved chunks are injected into the system prompt:
<document_context>
The following information was retrieved from the document collection to help answer the user's question.
Use this information to provide an accurate response. Cite the source files when relevant.
## Relevant Document Context
[Source: quarterly-report-q3.pdf]
Revenue increased 15% year-over-year to $2.3M...
---
[Source: financial-summary.docx]
The Q3 results exceeded analyst expectations...
</document_context>X-LLM-Relay-RAG: docs-assistant
X-LLM-Relay-Chunks: 5
X-LLM-Relay-Sources: quarterly-report-q3.pdf, financial-summary.docx
The Smart RAG detail page shows:
- Total requests
- Context injection rate
- Document count
- Chunk count
- Index status
- Organize documents - Keep related documents together
- Use descriptive filenames - Helps with source attribution
- Start with default settings - Tune after testing
- Monitor injection rate - Low rate may indicate threshold too high
- Re-index when documents change - Or use scheduled indexing
- Use vision model for PDFs - Improves table/chart extraction
- Lower the similarity threshold (try 0.5)
- Check if documents were indexed successfully
- Verify embedding model is working
- Check ChromaDB connection
- Raise the similarity threshold
- Reduce chunk size for more precise matching
- Check if documents contain the expected content
- Check container logs for errors
- Verify source path is accessible
- Ensure embedding model is available
- Check ChromaDB is running
- Reduce max_results
- Use local embedding model
- Check ChromaDB performance
- Docling may timeout on very large files
- Split into smaller documents
- Check container memory limits