A web-based tool for bulk metadata management in Paperless-ngx. Clean up unused tags, correspondents, and document types, merge similar items, and maintain a tidy document management system.
- 📊 Metadata Overview: View all tags, correspondents, and document types with document counts
- 🧹 Cleanup: Identify and bulk-delete items with zero or few documents
- 🔀 Smart Merge: Auto-suggest similar items for merging based on:
- Prefix matching
.*- Groups items starting with the same word (e.g., "account-personal", "account-business") - Spelling similarity
~- Groups items with similar spelling using Levenshtein distance (catches typos) - Semantic similarity
≈- Groups items with related meanings using word associations (e.g., "invoice" and "bill") - AI grouping
⚡- Optional LLM-powered grouping using OpenAI, Anthropic, or Ollama
- Prefix matching
- ⚡ Fast: Client-side grouping for instant filtering and responsive UI
- 🔒 Safe: Confirmation dialogs for all destructive operations
- 🐳 Docker Ready: Simple deployment with Docker Compose
Coming soon
In Paperless-ngx, go to Settings → Administration → API Tokens and create a new token.
Alternatively, create one via CLI:
docker exec paperless python manage.py generate_api_token <username>docker run -d \
--name paperless-metadata-manager \
-p 8080:8000 \
-e PAPERLESS_URL=http://your-paperless-url:8000 \
-e PAPERLESS_API_TOKEN=your_api_token_here \
--restart unless-stopped \
ghcr.io/benhumphry/paperless-metadata-manager:latestAccess the web UI at http://localhost:8080
Create a docker-compose.yml:
services:
paperless-metadata-manager:
image: ghcr.io/benhumphry/paperless-metadata-manager:latest
environment:
- PAPERLESS_URL=http://your-paperless-url:8000
- PAPERLESS_API_TOKEN=your_api_token_here
ports:
- "8080:8000"
restart: unless-stoppedThen run:
docker compose up -dAll configuration is via environment variables (set in .env file):
| Variable | Required | Default | Description |
|---|---|---|---|
PAPERLESS_URL |
✅ | - | Base URL of your Paperless-ngx instance |
PAPERLESS_API_TOKEN |
✅ | - | API token for authentication |
PORT |
❌ | 8000 |
Port for the web UI |
LOG_LEVEL |
❌ | info |
Logging level (debug, info, warning, error) |
EXCLUDE_PATTERNS |
❌ | new,inbox,todo,review |
Comma-separated list of tag patterns to exclude from cleanup suggestions |
LLM_TYPE |
❌ | - | LLM provider: openai, anthropic, or ollama |
LLM_API_URL |
❌ | varies | API URL (required for Ollama, optional for others) |
LLM_API_TOKEN |
❌ | - | API token (required for OpenAI/Anthropic) |
LLM_MODEL |
❌ | varies | Model name (e.g., gpt-5-mini, claude-3-haiku-20240307, llama3) |
LLM_LANGUAGE |
❌ | English |
Language for LLM responses |
LLM_PROMPT |
❌ | - | Custom prompt template (advanced, see below) |
The LLM_PROMPT setting allows advanced users to customize the prompt sent to the LLM. This is useful if the default prompt doesn't work well with your specific model.
Available variables:
{language}- The configured language (fromLLM_LANGUAGE){item_type}- The item type being grouped (e.g., "tags", "correspondents"){item_type_upper}- Uppercase version (e.g., "TAGS", "CORRESPONDENTS"){items}- The list of items to group, one per line with "- " prefix
Example:
LLM_PROMPT=Analyze these {item_type_upper} and group similar ones. Respond in {language} with JSON: {{"groups": {{"GroupName": ["item1", "item2"]}}}}. Items:\n{items}Note: Use {{ and }} to escape literal braces in the JSON format instruction.
The EXCLUDE_PATTERNS setting allows you to protect important tags from being suggested for deletion, even if they have few documents. This is useful for:
- Workflow tags (new, inbox, todo, review)
- Project tags you want to keep empty
- Placeholder tags for future use
Examples:
# Simple patterns (case-insensitive substring match)
EXCLUDE_PATTERNS=new,inbox,todo,review
# Regex patterns for more control
EXCLUDE_PATTERNS=^important.*,^keep-.*,archived-\d+
# Mixed patterns
EXCLUDE_PATTERNS=new,inbox,^project-.*,review,^archive-Choose from Tags, Correspondents, or Document Types using the dropdown at the top.
View all items with:
- Document count (color-coded: red=0, green=1+)
- Match type (None, Any, All, Literal, Regex, Fuzzy, Auto)
- Tag color preview (for tags only)
Click column headers to sort. Filter by name using the search box.
Find items that are candidates for deletion:
- Filter by maximum document count (0, 1, 2, or 5)
- Toggle "Exclude auto-match" to show/hide auto-matching items
- Auto-excludes items matching configured patterns (tags only)
- Select individual items or use "Select All"
- Bulk delete with confirmation
Consolidate similar items with intelligent grouping:
Suggestion Groups:
- Groups are displayed alphabetically with a type indicator:
.*(blue) = Prefix match - items sharing a common starting word~(green) = Spelling similarity - items with similar spelling (Levenshtein distance)≈(purple) = Semantic similarity - items with related meanings (word associations)⚡(amber) = AI suggested - LLM-powered grouping (requires LLM configuration)
- Items can appear in multiple groups if they match multiple criteria
- Use the checkboxes to enable/disable each grouping type
AI Grouping (Optional):
- Configure
LLM_TYPEandLLM_API_TOKENto enable the "⚡ AI" checkbox - Supports OpenAI, Anthropic, and local Ollama models
- Sends all item names to the LLM in a single request for intelligent grouping
- Results are cached per session - merging items updates the cache without re-querying
- Great for finding semantic relationships the other methods might miss
Note: Cloud APIs (OpenAI, Anthropic) typically respond in under a minute. Local models via Ollama may take significantly longer for large datasets (500+ items), depending on your hardware.
How to Merge:
- Click "Find Suggestions" to load all items and compute groups
- Click on suggestion groups to add all items to selection
- Or use custom search to find specific items
- Click individual items to add/remove from selection
- Enter target name and click "Merge Selected"
Live Filtering:
- Type in the prefix filter box for instant filtering (no server round-trip)
- Groups are recomputed client-side as you type
The merge process:
- Creates the target item (if it doesn't exist)
- Reassigns all documents to the target
- Deletes the source items
- Automatically refreshes suggestions to reflect the changes
If Paperless-ngx is running in Docker, connect to the same network for internal communication:
services:
paperless-metadata-manager:
image: ghcr.io/benhumphry/paperless-metadata-manager:latest
environment:
- PAPERLESS_URL=http://paperless:8000
- PAPERLESS_API_TOKEN=your_api_token_here
ports:
- "8080:8000"
networks:
- paperless_default
restart: unless-stopped
networks:
paperless_default:
external: trueReplace paperless with your Paperless-ngx container name.
The app runs on port 8000 by default. Configure your reverse proxy (nginx, Caddy, Traefik, etc.) to proxy to this port. No special headers are required.
If you prefer to build locally:
git clone https://github.com/benhumphry/paperless-metadata-manager.git
cd paperless-metadata-manager
docker build -t paperless-metadata-manager .- Python 3.12+
- pip
# Create virtual environment
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install dependencies
pip install -r requirements.txt
# Copy and configure .env
cp example.env .env
nano .env
# Run development server
uvicorn app.main:app --reload --port 8000paperless-metadata-manager/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── config.py # Settings from environment
│ ├── paperless_client.py # Async Paperless API client
│ ├── routers/
│ │ ├── health.py # Health check endpoints
│ │ └── tags.py # Tag management endpoints
│ ├── templates/
│ │ ├── base.html # Base template
│ │ └── index.html # Main UI
│ └── static/
│ └── js/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── example.env
└── README.md
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web UI |
/health |
GET | Basic health check |
/health/full |
GET | Health check with Paperless connection test |
/api/tags |
GET | List all tags |
/api/tags/low-usage |
GET | List low-usage tags |
/api/tags/all |
GET | Get all tags (for client-side processing) |
/api/tags/delete |
POST | Delete tags |
/api/tags/merge/preview |
POST | Preview merge operation |
/api/tags/merge |
POST | Execute merge operation |
Support for managing custom field values is planned for a future release. Unlike tags, correspondents, and document types (which are standalone entities), custom fields work differently in Paperless-ngx:
Challenges:
- Custom field values are stored on documents, not as separate entities
- There's no API endpoint to list all unique values for a custom field
- Aggregating values requires scanning all documents (performance impact)
- Only "select" type fields have discrete, mergeable values
- API format for select options changed in Paperless-ngx API v7+
Planned Approach:
- Support only "select" type custom fields initially
- Build a value aggregation layer that scans documents
- Use the
bulk_editAPI withmodify_custom_fieldsfor merging values - Implement "merge" as reassigning documents from one value to another
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Paperless-ngx - The amazing document management system
- FastAPI - Modern Python web framework
- HTMX - High power tools for HTML
- Alpine.js - Lightweight JavaScript framework
- Tailwind CSS - Utility-first CSS framework
