Paperless Metadata Manager

A web-based tool for bulk metadata management in Paperless-ngx. Clean up unused tags, correspondents, and document types, merge similar items, and maintain a tidy document management system.

Features

📊 Metadata Overview: View all tags, correspondents, and document types with document counts
🧹 Cleanup: Identify and bulk-delete items with zero or few documents
🔀 Smart Merge: Auto-suggest similar items for merging based on:
- Prefix matching .* - Groups items starting with the same word (e.g., "account-personal", "account-business")
- Spelling similarity ~ - Groups items with similar spelling using Levenshtein distance (catches typos)
- Semantic similarity ≈ - Groups items with related meanings using word associations (e.g., "invoice" and "bill")
- AI grouping ⚡ - Optional LLM-powered grouping using OpenAI, Anthropic, or Ollama
⚡ Fast: Client-side grouping for instant filtering and responsive UI
🔒 Safe: Confirmation dialogs for all destructive operations
🐳 Docker Ready: Simple deployment with Docker Compose

Screenshots

Coming soon

Quick Start

1. Get your Paperless-ngx API Token

In Paperless-ngx, go to Settings → Administration → API Tokens and create a new token.

Alternatively, create one via CLI:

docker exec paperless python manage.py generate_api_token <username>

2. Run with Docker

docker run -d \
  --name paperless-metadata-manager \
  -p 8080:8000 \
  -e PAPERLESS_URL=http://your-paperless-url:8000 \
  -e PAPERLESS_API_TOKEN=your_api_token_here \
  --restart unless-stopped \
  ghcr.io/benhumphry/paperless-metadata-manager:latest

Access the web UI at http://localhost:8080

Alternative: Docker Compose

Create a docker-compose.yml:

services:
  paperless-metadata-manager:
    image: ghcr.io/benhumphry/paperless-metadata-manager:latest
    environment:
      - PAPERLESS_URL=http://your-paperless-url:8000
      - PAPERLESS_API_TOKEN=your_api_token_here
    ports:
      - "8080:8000"
    restart: unless-stopped

Then run:

docker compose up -d

Configuration

All configuration is via environment variables (set in .env file):

Variable	Required	Default	Description
`PAPERLESS_URL`	✅	-	Base URL of your Paperless-ngx instance
`PAPERLESS_API_TOKEN`	✅	-	API token for authentication
`PORT`	❌	`8000`	Port for the web UI
`LOG_LEVEL`	❌	`info`	Logging level (debug, info, warning, error)
`EXCLUDE_PATTERNS`	❌	`new,inbox,todo,review`	Comma-separated list of tag patterns to exclude from cleanup suggestions
`LLM_TYPE`	❌	-	LLM provider: `openai`, `anthropic`, or `ollama`
`LLM_API_URL`	❌	varies	API URL (required for Ollama, optional for others)
`LLM_API_TOKEN`	❌	-	API token (required for OpenAI/Anthropic)
`LLM_MODEL`	❌	varies	Model name (e.g., `gpt-5-mini`, `claude-3-haiku-20240307`, `llama3`)
`LLM_LANGUAGE`	❌	`English`	Language for LLM responses
`LLM_PROMPT`	❌	-	Custom prompt template (advanced, see below)

Custom LLM Prompt

The LLM_PROMPT setting allows advanced users to customize the prompt sent to the LLM. This is useful if the default prompt doesn't work well with your specific model.

Available variables:

{language} - The configured language (from LLM_LANGUAGE)
{item_type} - The item type being grouped (e.g., "tags", "correspondents")
{item_type_upper} - Uppercase version (e.g., "TAGS", "CORRESPONDENTS")
{items} - The list of items to group, one per line with "- " prefix

Example:

LLM_PROMPT=Analyze these {item_type_upper} and group similar ones. Respond in {language} with JSON: {{"groups": {{"GroupName": ["item1", "item2"]}}}}. Items:\n{items}

Note: Use {{ and }} to escape literal braces in the JSON format instruction.

Exclude Patterns

The EXCLUDE_PATTERNS setting allows you to protect important tags from being suggested for deletion, even if they have few documents. This is useful for:

Workflow tags (new, inbox, todo, review)
Project tags you want to keep empty
Placeholder tags for future use

Examples:

# Simple patterns (case-insensitive substring match)
EXCLUDE_PATTERNS=new,inbox,todo,review

# Regex patterns for more control
EXCLUDE_PATTERNS=^important.*,^keep-.*,archived-\d+

# Mixed patterns
EXCLUDE_PATTERNS=new,inbox,^project-.*,review,^archive-

Usage

Metadata Type Selector

Choose from Tags, Correspondents, or Document Types using the dropdown at the top.

All Tab

View all items with:

Document count (color-coded: red=0, green=1+)
Match type (None, Any, All, Literal, Regex, Fuzzy, Auto)
Tag color preview (for tags only)

Click column headers to sort. Filter by name using the search box.

Cleanup Tab

Find items that are candidates for deletion:

Filter by maximum document count (0, 1, 2, or 5)
Toggle "Exclude auto-match" to show/hide auto-matching items
Auto-excludes items matching configured patterns (tags only)
Select individual items or use "Select All"
Bulk delete with confirmation

Merge Tab

Consolidate similar items with intelligent grouping:

Suggestion Groups:

Groups are displayed alphabetically with a type indicator:
- .* (blue) = Prefix match - items sharing a common starting word
- ~ (green) = Spelling similarity - items with similar spelling (Levenshtein distance)
- ≈ (purple) = Semantic similarity - items with related meanings (word associations)
- ⚡ (amber) = AI suggested - LLM-powered grouping (requires LLM configuration)
Items can appear in multiple groups if they match multiple criteria
Use the checkboxes to enable/disable each grouping type

AI Grouping (Optional):

Configure LLM_TYPE and LLM_API_TOKEN to enable the "⚡ AI" checkbox
Supports OpenAI, Anthropic, and local Ollama models
Sends all item names to the LLM in a single request for intelligent grouping
Results are cached per session - merging items updates the cache without re-querying
Great for finding semantic relationships the other methods might miss

Note: Cloud APIs (OpenAI, Anthropic) typically respond in under a minute. Local models via Ollama may take significantly longer for large datasets (500+ items), depending on your hardware.

How to Merge:

Click "Find Suggestions" to load all items and compute groups
Click on suggestion groups to add all items to selection
Or use custom search to find specific items
Click individual items to add/remove from selection
Enter target name and click "Merge Selected"

Live Filtering:

Type in the prefix filter box for instant filtering (no server round-trip)
Groups are recomputed client-side as you type

The merge process:

Creates the target item (if it doesn't exist)
Reassigns all documents to the target
Deletes the source items
Automatically refreshes suggestions to reflect the changes

Deployment Options

Same Network as Paperless-ngx

If Paperless-ngx is running in Docker, connect to the same network for internal communication:

services:
  paperless-metadata-manager:
    image: ghcr.io/benhumphry/paperless-metadata-manager:latest
    environment:
      - PAPERLESS_URL=http://paperless:8000
      - PAPERLESS_API_TOKEN=your_api_token_here
    ports:
      - "8080:8000"
    networks:
      - paperless_default
    restart: unless-stopped

networks:
  paperless_default:
    external: true

Replace paperless with your Paperless-ngx container name.

Behind a Reverse Proxy

The app runs on port 8000 by default. Configure your reverse proxy (nginx, Caddy, Traefik, etc.) to proxy to this port. No special headers are required.

Building from Source

If you prefer to build locally:

git clone https://github.com/benhumphry/paperless-metadata-manager.git
cd paperless-metadata-manager
docker build -t paperless-metadata-manager .

Development

Prerequisites

Python 3.12+
pip

Local Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Copy and configure .env
cp example.env .env
nano .env

# Run development server
uvicorn app.main:app --reload --port 8000

Project Structure

paperless-metadata-manager/
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI application
│   ├── config.py            # Settings from environment
│   ├── paperless_client.py  # Async Paperless API client
│   ├── routers/
│   │   ├── health.py        # Health check endpoints
│   │   └── tags.py          # Tag management endpoints
│   ├── templates/
│   │   ├── base.html        # Base template
│   │   └── index.html       # Main UI
│   └── static/
│       └── js/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── example.env
└── README.md

API Endpoints

Endpoint	Method	Description
`/`	GET	Web UI
`/health`	GET	Basic health check
`/health/full`	GET	Health check with Paperless connection test
`/api/tags`	GET	List all tags
`/api/tags/low-usage`	GET	List low-usage tags
`/api/tags/all`	GET	Get all tags (for client-side processing)
`/api/tags/delete`	POST	Delete tags
`/api/tags/merge/preview`	POST	Preview merge operation
`/api/tags/merge`	POST	Execute merge operation

Future Enhancements

Custom Field Value Management

Support for managing custom field values is planned for a future release. Unlike tags, correspondents, and document types (which are standalone entities), custom fields work differently in Paperless-ngx:

Challenges:

Custom field values are stored on documents, not as separate entities
There's no API endpoint to list all unique values for a custom field
Aggregating values requires scanning all documents (performance impact)
Only "select" type fields have discrete, mergeable values
API format for select options changed in Paperless-ngx API v7+

Planned Approach:

Support only "select" type custom fields initially
Build a value aggregation layer that scans documents
Use the bulk_edit API with modify_custom_fields for merging values
Implement "merge" as reassigning documents from one value to another

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Paperless-ngx - The amazing document management system
FastAPI - Modern Python web framework
HTMX - High power tools for HTML
Alpine.js - Lightweight JavaScript framework
Tailwind CSS - Utility-first CSS framework

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
example.env		example.env
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Paperless Metadata Manager

Features

Screenshots

Quick Start

1. Get your Paperless-ngx API Token

2. Run with Docker

Alternative: Docker Compose

Configuration

Custom LLM Prompt

Exclude Patterns

Usage

Metadata Type Selector

All Tab

Cleanup Tab

Merge Tab

Deployment Options

Same Network as Paperless-ngx

Behind a Reverse Proxy

Building from Source

Development

Prerequisites

Local Setup

Project Structure

API Endpoints

Future Enhancements

Custom Field Value Management

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages