DocForge

DocForge is a modular, privacy-conscious document conversion API built with Flask for teams and individuals who need reliable file conversion without depending on opaque third-party web tools.

Why DocForge

Document conversion is a routine task, but many existing tools create unnecessary friction. Online platforms often put basic features behind paywalls, require sensitive files to be uploaded to third-party servers, provide little clarity around file retention, and produce inconsistent output quality. They also tend to solve only one narrow use case at a time, which makes them a poor fit for repeatable business workflows and low-connectivity environments.

DocForge was created to offer a more practical alternative: a local-first, extensible conversion system that gives users more privacy, control, and transparency in how documents are processed. It is designed to support real workflows across technical and non-technical use cases, whether the goal is to convert a single file safely or integrate document processing into a larger internal system.

What It Delivers

Privacy-first conversion for sensitive files
Lower dependence on paid conversion platforms
Support for multiple document workflows in one service
Merge multiple documents into a single PDF
Extensible architecture for adding new converters over time
Better transparency for debugging and operational use
A strong fit for self-hosted and internal business environments

Who It Helps

Developers building document processing pipelines
Businesses handling internal reports and client files
Students and administrators working across multiple formats
Teams with confidentiality, compliance, or connectivity constraints

Step 1: Check Python 3

Before setup, confirm that Python 3 is available on your system.

macOS / Linux

python3 --version

Windows

py -3 --version

Step 2: Install Python 3

Install Python 3 on Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y python3 python3-venv python3-pip

Install Python 3 on Fedora

sudo dnf install -y python3 python3-pip

Install Python 3 on RHEL / CentOS Stream

sudo dnf install -y python3 python3-pip

Install Python 3 on Arch Linux

sudo pacman -Sy python python-pip

Install Python 3 on macOS

brew install python

Install Python 3 on Windows

Using winget:

winget install -e --id Python.Python.3.11

Using choco:

choco install python --yes

Step 3: Install System Dependencies

DocForge requires Python 3.11+ and a few external tools depending on the conversion path you use:

LibreOffice for DOCX → PDF, PPTX → PDF, and PDF → DOC
Pandoc for DOCX → Markdown, HTML → Markdown, and Markdown → PDF
WeasyPrint system libraries for HTML → PDF, Markdown → PDF, and TXT → PDF

Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y libreoffice pandoc
sudo apt-get install -y libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev libcairo2

DocForge validates required external tools at conversion time and returns a clear error if a dependency such as Pandoc or LibreOffice is missing.

Step 4: Set Up The Virtual Environment

macOS / Linux

python3 -m venv venv
source venv/bin/activate

Windows PowerShell

py -3 -m venv venv
.\venv\Scripts\Activate.ps1

Step 5: Install Python Requirements

macOS / Linux

python3 -m pip install -r requirements.txt

Windows PowerShell

py -3 -m pip install -r requirements.txt

Step 6: Run The Program

macOS / Linux

python3 run.py

Windows

py -3 run.py

The server starts on http://127.0.0.1:5002.

Supported Conversions

Input	Output	Engine
DOCX	HTML	Pandoc
DOCX	MD	Pandoc
DOCX	PDF	LibreOffice headless
PPTX	PDF	LibreOffice headless
HTML	MD	Pandoc
MD	PDF	Pandoc + WeasyPrint
PDF	DOC	pdf2docx + LibreOffice
PDF	DOCX	pdf2docx
PDF	HTML	PyMuPDF
PDF	MD	PyMuPDF4LLM
HTML	PDF	WeasyPrint
TXT	PDF	WeasyPrint

API Endpoints

Health Check

GET /health

curl http://127.0.0.1:5002/health

Response:

{"service": "docforge", "status": "healthy"}

Convert a File

POST /convert

Field	Type	Description
`file`	file	One or more documents to convert
`target_format`	text	Desired output extension (e.g. `pdf`)

If you upload a single file, DocForge returns the converted file directly. If you upload multiple files with the same target_format, DocForge returns a ZIP archive containing all converted outputs.

Merge Documents Into One PDF

POST /merge

Field	Type	Description
`file`	file	Two or more supported documents to merge to PDF

The merged PDF preserves the upload order from the request. Non-PDF inputs are converted to PDF first using the existing conversion pipeline.

DOCX → PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.docx" \
  -F "target_format=pdf" \
  -o converted.pdf

DOCX → Markdown

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.docx" \
  -F "target_format=md" \
  -o converted.md

DOCX → HTML

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.docx" \
  -F "target_format=html" \
  -o converted.html

Batch conversion to PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@chapter1.docx" \
  -F "file=@chapter2.docx" \
  -F "target_format=pdf" \
  -o docforge_batch_pdf.zip

PPTX → PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@slides.pptx" \
  -F "target_format=pdf" \
  -o slides.pdf

Merge multiple documents into one PDF

curl -X POST http://127.0.0.1:5002/merge \
  -F "file=@intro.docx" \
  -F "file=@appendix.pdf" \
  -F "file=@notes.md" \
  -o docforge_merged.pdf

HTML → Markdown

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@page.html" \
  -F "target_format=md" \
  -o converted.md

Markdown → PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.md" \
  -F "target_format=pdf" \
  -o converted.pdf

PDF → DOCX

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.pdf" \
  -F "target_format=docx" \
  -o converted.docx

PDF → DOC

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.pdf" \
  -F "target_format=doc" \
  -o converted.doc

PDF → HTML

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.pdf" \
  -F "target_format=html" \
  -o converted.html

HTML → PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@page.html" \
  -F "target_format=pdf" \
  -o converted.pdf

PDF → Markdown

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@document.pdf" \
  -F "target_format=md" \
  -o converted.md

TXT → PDF

curl -X POST http://127.0.0.1:5002/convert \
  -F "file=@notes.txt" \
  -F "target_format=pdf" \
  -o converted.pdf

Project Structure

├── app/
│   ├── __init__.py          # App factory
│   ├── config.py            # Configuration
│   ├── routes.py            # Flask API routes
│   ├── converters/
│   │   ├── base.py          # Abstract base converter
│   │   ├── docx_to_html.py  # Pandoc converter
│   │   ├── docx_to_md.py    # Pandoc converter
│   │   ├── docx_to_pdf.py   # LibreOffice converter
│   │   ├── html_to_md.py    # Pandoc converter
│   │   ├── md_to_pdf.py     # Pandoc + WeasyPrint converter
│   │   ├── pdf_to_doc.py    # pdf2docx + LibreOffice converter
│   │   ├── pdf_to_docx.py   # pdf2docx converter
│   │   ├── pdf_to_html.py   # PyMuPDF converter
│   │   ├── pdf_to_md.py     # PyMuPDF4LLM converter
│   │   ├── html_to_pdf.py   # WeasyPrint converter
│   │   ├── pptx_to_pdf.py   # LibreOffice converter
│   │   └── txt_to_pdf.py    # WeasyPrint converter
│   ├── services/
│   │   ├── registry.py      # Converter registry
│   │   ├── engine.py        # Conversion orchestration
│   │   ├── file_manager.py  # Upload handling
│   │   └── merger.py        # Document merge workflow
│   └── utils/
│       ├── exceptions.py    # Custom exceptions
│       └── helpers.py       # Filename utilities
├── run.py                   # Dev server entry point
├── requirements.txt
└── README.md

Adding a New Converter

Create app/converters/my_format_to_other.py
Subclass BaseConverter, set input_format and output_format
Implement the convert() method

Register it in app/__init__.py:

registry.register(MyFormatToOtherConverter())

Add the input extension to ALLOWED_INPUT_EXTENSIONS in config.py

Notes on PDF → DOC

PDF → DOC uses an intermediate DOCX file under the hood. This keeps the implementation compatible with the existing converter architecture and is generally more stable than attempting direct PDF → DOC export. Legacy DOC output may still lose some layout fidelity compared with DOCX, depending on the source PDF.

Notes on DOCX → Markdown

DOCX → Markdown uses Pandoc to produce GitHub-Flavored Markdown. Rich Word features that do not map cleanly to Markdown may be simplified during export, which keeps the output practical for plain-text workflows.

Notes on DOCX → HTML

DOCX → HTML uses Pandoc to produce standalone HTML. The output is convenient for browser viewing and downstream HTML-based conversions, though Word-specific layout or styling may be simplified during export.

Notes on PPTX → PDF

PPTX → PDF uses LibreOffice in headless mode. Complex slide animations, embedded media behavior, or highly custom fonts may render differently in the exported PDF depending on the host LibreOffice installation.

Notes on HTML → Markdown

HTML → Markdown uses Pandoc to produce GitHub-Flavored Markdown. Complex HTML structures or raw embedded elements may be simplified during export so the result stays practical for Markdown-based workflows.

Notes on Markdown → PDF

Markdown → PDF uses Pandoc plus WeasyPrint through an intermediate standalone HTML file. This keeps PDF generation reliable without requiring a LaTeX engine, though the final appearance will follow HTML/CSS rendering rather than a Word-processor layout model.

Notes on PDF → HTML

PDF → HTML is extracted page-by-page with PyMuPDF and combined into a single HTML document. The output is suitable for inspection and downstream HTML-based processing, but complex PDFs may still produce positioned markup that reflects the original page layout rather than semantic HTML.

Notes on TXT → PDF

TXT → PDF renders plain text through a simple HTML template using WeasyPrint. Common text encodings are attempted automatically, and the output preserves line breaks and spacing in a print-friendly monospace layout.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

DocForge

Why DocForge

What It Delivers

Who It Helps

Step 1: Check Python 3

macOS / Linux

Windows

Step 2: Install Python 3

Install Python 3 on Ubuntu / Debian

Install Python 3 on Fedora

Install Python 3 on RHEL / CentOS Stream

Install Python 3 on Arch Linux

Install Python 3 on macOS

Install Python 3 on Windows

Step 3: Install System Dependencies

Ubuntu / Debian

Step 4: Set Up The Virtual Environment

macOS / Linux

Windows PowerShell

Step 5: Install Python Requirements

macOS / Linux

Windows PowerShell

Step 6: Run The Program

macOS / Linux

Windows

Supported Conversions

API Endpoints

Health Check

Convert a File

Merge Documents Into One PDF

DOCX → PDF

DOCX → Markdown

DOCX → HTML

Batch conversion to PDF

PPTX → PDF

Merge multiple documents into one PDF

HTML → Markdown

Markdown → PDF

PDF → DOCX

PDF → DOC

PDF → HTML

HTML → PDF

PDF → Markdown

TXT → PDF

Project Structure

Adding a New Converter

Notes on PDF → DOC

Notes on DOCX → Markdown

Notes on DOCX → HTML

Notes on PPTX → PDF

Notes on HTML → Markdown

Notes on Markdown → PDF

Notes on PDF → HTML

Notes on TXT → PDF

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages