DocForge is a modular, privacy-conscious document conversion API built with Flask for teams and individuals who need reliable file conversion without depending on opaque third-party web tools.
Document conversion is a routine task, but many existing tools create unnecessary friction. Online platforms often put basic features behind paywalls, require sensitive files to be uploaded to third-party servers, provide little clarity around file retention, and produce inconsistent output quality. They also tend to solve only one narrow use case at a time, which makes them a poor fit for repeatable business workflows and low-connectivity environments.
DocForge was created to offer a more practical alternative: a local-first, extensible conversion system that gives users more privacy, control, and transparency in how documents are processed. It is designed to support real workflows across technical and non-technical use cases, whether the goal is to convert a single file safely or integrate document processing into a larger internal system.
- Privacy-first conversion for sensitive files
- Lower dependence on paid conversion platforms
- Support for multiple document workflows in one service
- Merge multiple documents into a single PDF
- Extensible architecture for adding new converters over time
- Better transparency for debugging and operational use
- A strong fit for self-hosted and internal business environments
- Developers building document processing pipelines
- Businesses handling internal reports and client files
- Students and administrators working across multiple formats
- Teams with confidentiality, compliance, or connectivity constraints
Before setup, confirm that Python 3 is available on your system.
python3 --versionpy -3 --versionsudo apt-get update
sudo apt-get install -y python3 python3-venv python3-pipsudo dnf install -y python3 python3-pipsudo dnf install -y python3 python3-pipsudo pacman -Sy python python-pipbrew install pythonUsing winget:
winget install -e --id Python.Python.3.11Using choco:
choco install python --yesDocForge requires Python 3.11+ and a few external tools depending on the conversion path you use:
- LibreOffice for DOCX → PDF, PPTX → PDF, and PDF → DOC
- Pandoc for DOCX → Markdown, HTML → Markdown, and Markdown → PDF
- WeasyPrint system libraries for HTML → PDF, Markdown → PDF, and TXT → PDF
sudo apt-get update
sudo apt-get install -y libreoffice pandoc
sudo apt-get install -y libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev libcairo2DocForge validates required external tools at conversion time and returns a clear error if a dependency such as Pandoc or LibreOffice is missing.
python3 -m venv venv
source venv/bin/activatepy -3 -m venv venv
.\venv\Scripts\Activate.ps1python3 -m pip install -r requirements.txtpy -3 -m pip install -r requirements.txtpython3 run.pypy -3 run.pyThe server starts on http://127.0.0.1:5002.
| Input | Output | Engine |
|---|---|---|
| DOCX | HTML | Pandoc |
| DOCX | MD | Pandoc |
| DOCX | LibreOffice headless | |
| PPTX | LibreOffice headless | |
| HTML | MD | Pandoc |
| MD | Pandoc + WeasyPrint | |
| DOC | pdf2docx + LibreOffice | |
| DOCX | pdf2docx | |
| HTML | PyMuPDF | |
| MD | PyMuPDF4LLM | |
| HTML | WeasyPrint | |
| TXT | WeasyPrint |
GET /health
curl http://127.0.0.1:5002/healthResponse:
{"service": "docforge", "status": "healthy"}POST /convert
| Field | Type | Description |
|---|---|---|
file |
file | One or more documents to convert |
target_format |
text | Desired output extension (e.g. pdf) |
If you upload a single file, DocForge returns the converted file directly.
If you upload multiple files with the same target_format, DocForge returns a
ZIP archive containing all converted outputs.
POST /merge
| Field | Type | Description |
|---|---|---|
file |
file | Two or more supported documents to merge to PDF |
The merged PDF preserves the upload order from the request. Non-PDF inputs are converted to PDF first using the existing conversion pipeline.
curl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.docx" \
-F "target_format=pdf" \
-o converted.pdfcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.docx" \
-F "target_format=md" \
-o converted.mdcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.docx" \
-F "target_format=html" \
-o converted.htmlcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@chapter1.docx" \
-F "file=@chapter2.docx" \
-F "target_format=pdf" \
-o docforge_batch_pdf.zipcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@slides.pptx" \
-F "target_format=pdf" \
-o slides.pdfcurl -X POST http://127.0.0.1:5002/merge \
-F "file=@intro.docx" \
-F "file=@appendix.pdf" \
-F "file=@notes.md" \
-o docforge_merged.pdfcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@page.html" \
-F "target_format=md" \
-o converted.mdcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.md" \
-F "target_format=pdf" \
-o converted.pdfcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.pdf" \
-F "target_format=docx" \
-o converted.docxcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.pdf" \
-F "target_format=doc" \
-o converted.doccurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.pdf" \
-F "target_format=html" \
-o converted.htmlcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@page.html" \
-F "target_format=pdf" \
-o converted.pdfcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@document.pdf" \
-F "target_format=md" \
-o converted.mdcurl -X POST http://127.0.0.1:5002/convert \
-F "file=@notes.txt" \
-F "target_format=pdf" \
-o converted.pdf├── app/
│ ├── __init__.py # App factory
│ ├── config.py # Configuration
│ ├── routes.py # Flask API routes
│ ├── converters/
│ │ ├── base.py # Abstract base converter
│ │ ├── docx_to_html.py # Pandoc converter
│ │ ├── docx_to_md.py # Pandoc converter
│ │ ├── docx_to_pdf.py # LibreOffice converter
│ │ ├── html_to_md.py # Pandoc converter
│ │ ├── md_to_pdf.py # Pandoc + WeasyPrint converter
│ │ ├── pdf_to_doc.py # pdf2docx + LibreOffice converter
│ │ ├── pdf_to_docx.py # pdf2docx converter
│ │ ├── pdf_to_html.py # PyMuPDF converter
│ │ ├── pdf_to_md.py # PyMuPDF4LLM converter
│ │ ├── html_to_pdf.py # WeasyPrint converter
│ │ ├── pptx_to_pdf.py # LibreOffice converter
│ │ └── txt_to_pdf.py # WeasyPrint converter
│ ├── services/
│ │ ├── registry.py # Converter registry
│ │ ├── engine.py # Conversion orchestration
│ │ ├── file_manager.py # Upload handling
│ │ └── merger.py # Document merge workflow
│ └── utils/
│ ├── exceptions.py # Custom exceptions
│ └── helpers.py # Filename utilities
├── run.py # Dev server entry point
├── requirements.txt
└── README.md
- Create
app/converters/my_format_to_other.py - Subclass
BaseConverter, setinput_formatandoutput_format - Implement the
convert()method - Register it in
app/__init__.py:registry.register(MyFormatToOtherConverter())
- Add the input extension to
ALLOWED_INPUT_EXTENSIONSinconfig.py
PDF → DOC uses an intermediate DOCX file under the hood. This keeps the implementation compatible with the existing converter architecture and is generally more stable than attempting direct PDF → DOC export. Legacy DOC output may still lose some layout fidelity compared with DOCX, depending on the source PDF.
DOCX → Markdown uses Pandoc to produce GitHub-Flavored Markdown. Rich Word features that do not map cleanly to Markdown may be simplified during export, which keeps the output practical for plain-text workflows.
DOCX → HTML uses Pandoc to produce standalone HTML. The output is convenient for browser viewing and downstream HTML-based conversions, though Word-specific layout or styling may be simplified during export.
PPTX → PDF uses LibreOffice in headless mode. Complex slide animations, embedded media behavior, or highly custom fonts may render differently in the exported PDF depending on the host LibreOffice installation.
HTML → Markdown uses Pandoc to produce GitHub-Flavored Markdown. Complex HTML structures or raw embedded elements may be simplified during export so the result stays practical for Markdown-based workflows.
Markdown → PDF uses Pandoc plus WeasyPrint through an intermediate standalone HTML file. This keeps PDF generation reliable without requiring a LaTeX engine, though the final appearance will follow HTML/CSS rendering rather than a Word-processor layout model.
PDF → HTML is extracted page-by-page with PyMuPDF and combined into a single HTML document. The output is suitable for inspection and downstream HTML-based processing, but complex PDFs may still produce positioned markup that reflects the original page layout rather than semantic HTML.
TXT → PDF renders plain text through a simple HTML template using WeasyPrint. Common text encodings are attempted automatically, and the output preserves line breaks and spacing in a print-friendly monospace layout.
MIT