Farmers for Forests - Document Automation POC

Cisco x F4F Hackathon | March 2026

Automation toolkit for Farmers for Forests (F4F) farmer onboarding workflows. Two Streamlit apps that replace manual document verification with AI-powered pipelines.

Quick Start

# Activate the virtual environment
venv312\Scripts\activate

# Run Use Case 1 - Land Record OCR
streamlit run usecase1_land_record_ocr.py

# Run Use Case 2 - CC Photo Verification
streamlit run usecase2_photo_verification.py

Requirement: Set CXAI_API_KEY in .env for Vision and Combined modes.

Use Case 1 - Land Record OCR and Extraction

File: usecase1_land_record_ocr.py

Extracts structured data from Maharashtra 7/12 (Saat-Baara) land record documents (PDFs or images) into a standardised JSON schema.

Three Extraction Modes

PaddleOCR (Offline) - Runs paddleocr_pdf_to_json_demo.py as a subprocess. No API key needed.
Vision (Online) - Sends document image to GPT-4 Vision API for direct extraction.
Combined (Recommended) - PaddleOCR + Vision run in parallel. GPT-4o-mini merges both results into a single output, resolving conflicts between sources.

How It Works (Single Document)

Upload a PDF or image
Quality Gate checks blur, brightness, contrast, resolution, and skew
- If PASS: skip preprocessing and go straight to extraction
- If FAIL: interactive enhancement panel (contrast, denoise, deskew, threshold)
Run extraction on raw document (and preprocessed version if applicable)
Comparative analysis via GPT-4o-mini (raw vs preprocessed accuracy)
Semantic and Knowledge Graph - infers ownership chain, maps encumbrances, and renders a visual relationship graph
Final output saved as structured JSON

Semantic and Knowledge Graph

After extraction, GPT-4o-mini performs semantic analysis on the structured JSON to produce:

Original owner identification from mutation reference numbers
Ownership chain - who transferred land to whom, via which mutation, transfer type (inheritance / sale / partition)
Current owners with account numbers, area shares, and assessment amounts
Encumbrance mapping - each loan/mortgage linked to its specific owner and bank
Water resources - wells with owners and mutation references
Interactive Graphviz graph visualising the full ownership and encumbrance network

Graph node legend:

Node	Colour	Meaning
Yellow box	`#FFF9C4`	Original / historical owner
Green box (bold)	`#C8E6C9`	Current owner
Red hexagon	`#FFCDD2`	Bank encumbrance
Blue diamond	`#B3E5FC`	Well
Green ellipse	`#C8E6C9`	Land parcel

The semantic knowledge graph is included in the final saved JSON under the semantic_knowledge_graph key.

Batch Processing

Upload multiple PDFs/images or scan a local folder
Queue-based processing with live progress table
Choose extraction mode per batch (combined / paddle / vision)
CSV export with key fields: district, taluka, village, survey numbers, owners, area
Expand any row to inspect full extraction JSON or run semantic analysis per document

Extracted Fields

Category	Fields
Location	`state`, `district`, `taluka`, `village`, `village_code`, `pu_id`
Land	`survey_number`, `sub_division`, `local_name`, `tenure`
Owners	`name`, `account_number`, `area_hectare`, `assessment_rupees`, `mutation_ref`
Area	cultivable (jirayat / bagayat), uncultivable (class A / B), `pot_kharab`, `total_area_hectare`
Assessment	`base_rupees`, `special_rupees`, `total_rupees`
Mutation	`last_number`, `last_date`, `pending`, `all_numbers`
Encumbrances	`type`, `bank_name`, `branch`, `amount_rupees`, `borrower_name`, `mutation_ref`
Water	wells (owner, mutation), irrigation
Signature	`date`, `verification_url`, `reference_number`
Comparison	`fields_differing`, `paddle_only`, `vision_only`

Use Case 2 - CC Training Photo Verification

File: usecase2_photo_verification.py

Verifies photographic evidence submitted for carbon credit training sessions. Each photo must prove a training event occurred with identifiable participants, location, and timestamp.

Three Verification Checks

Image Quality (OpenCV) - Blur score, brightness, contrast
Scene Analysis (GPT-4 Vision) - People count, F4F representative present, training context, outdoor/rural setting
Metadata Extraction (GPT-4 Vision) - GPS coordinates and date/time from photo overlay (e.g. GPS Map Camera app)

Accept / Reject Logic

A photo is ACCEPTED only if ALL of these pass:

Image quality is acceptable (not blurry, not too dark/bright)
Multiple people are visible in the frame
Scene looks like a training session (not a selfie, not indoors in a city)
GPS coordinates found in photo overlay
Date/time found in photo overlay

If any check fails, the photo is REJECTED with specific reasons.

Single Photo Mode

Upload a JPEG/PNG and walk through 4 tabs:

Upload Photo
Quality Check results
Scene and Metadata analysis (with GPS map)
Final Verdict (Accept/Reject with full breakdown)

Batch PDF Mode

Upload CC training PDFs (filename pattern: {surrogate_key}-{FID}-{LID}.pdf)
Automatically extracts training photo from page 3 of each PDF
Live-updating results table with Accept / Reject / Error counts
CSV export with all verification fields
Click any row to see the extracted photo and full result JSON

Project Structure

usecase1_land_record_ocr.py          --> UC1: Land Record OCR (Streamlit app)
usecase2_photo_verification.py       --> UC2: CC Photo Verification (Streamlit app)
paddleocr_pdf_to_json_demo.py        --> PaddleOCR subprocess worker (Python 3.12)
.env                                 --> CXAI_API_KEY (not committed)

venv312/                             --> Python 3.12 virtual environment

cc_data_final/                       --> Sample CC training PDFs
uploads/                             --> Uploaded documents (auto-created)
output/                              --> Extraction results (auto-created)
  ocr_output.json                        PaddleOCR raw output
  raw_combined.json                      UC1 raw pipeline result
  prep_combined.json                     UC1 preprocessed pipeline result
  comparative_output.json                UC1 final merged output + semantic graph
  cc_verification_results.csv            UC2 batch results
  uc1_extraction_results.csv             UC1 batch results

Architecture

UC1 Pipeline Flow

PDF/Image
    |
    v
Quality Gate (OpenCV)
    |
    +---> [Enhancement if needed]
    |                   |
    v                   v
PaddleOCR (subprocess)  +  GPT-4 Vision (API)
    |                           |
    +-------------+-------------+
                  |
                  v
          GPT-4o-mini Merge
                  |
                  v
          Structured JSON Output
                  |
                  v
          GPT-4o-mini Semantic Analysis
                  |
                  v
          Ownership Chain + Knowledge Graph

UC2 Pipeline Flow

PDF
    |
    v
Extract Page 3 (training photo)
    |
    v
OpenCV Quality Check
    |
    v
GPT-4 Vision (scene analysis + overlay extraction)
    |
    v
Accept / Reject Decision

Tech Stack

Component	Technology
UI	Streamlit
OCR (offline)	PaddleOCR via Python 3.12 subprocess
Vision AI	GPT-4 Vision via CXAI Playground API
Data merge	GPT-4o-mini
Image processing	OpenCV, Pillow
PDF rendering	pypdfium2
Graph visualisation	Graphviz (built-in Streamlit support)
Languages	Marathi, Hindi, English (Devanagari script)

Setup

Prerequisites

Python 3.12+ (for PaddleOCR subprocess)
Python 3.10+ (for Streamlit UI)
CXAI_API_KEY in .env

Install Dependencies

python -m venv venv312
venv312\Scripts\activate
pip install streamlit pandas opencv-python-headless numpy Pillow pypdfium2 requests python-dotenv paddleocr paddlepaddle

Design Decisions

Self-contained files - Each use case is a single .py file with zero cross-imports. Easy to deploy independently.
Quality gate before extraction - Prevents wasting API calls on unreadable documents.
Parallel extraction - PaddleOCR and Vision API run concurrently in Combined mode to cut wall-clock time.
CSV with file-lock retry - Handles Windows file-locking when CSV is open in Excel during batch runs.
Preprocessing is optional - The quality gate decides. Over-processing clean documents actually degrades accuracy.
Semantic analysis is on-demand - Runs only when the user clicks the button, avoiding unnecessary API costs.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
api		api
cc_data_final		cc_data_final
docs		docs
lib		lib
ui		ui
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
admin.py		admin.py
dev.ps1		dev.ps1
dev.sh		dev.sh
docker-compose.yml		docker-compose.yml
mcp_server.py		mcp_server.py
paddleocr_pdf_to_json_demo.py		paddleocr_pdf_to_json_demo.py
requirements.txt		requirements.txt
rues.txt		rues.txt
test.pdf		test.pdf
usecase1_land_record_ocr.py		usecase1_land_record_ocr.py
usecase2_photo_verification.py		usecase2_photo_verification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Farmers for Forests - Document Automation POC

Quick Start

Use Case 1 - Land Record OCR and Extraction

Three Extraction Modes

How It Works (Single Document)

Semantic and Knowledge Graph

Batch Processing

Extracted Fields

Use Case 2 - CC Training Photo Verification

Three Verification Checks

Accept / Reject Logic

Single Photo Mode

Batch PDF Mode

Project Structure

Architecture

UC1 Pipeline Flow

UC2 Pipeline Flow

Tech Stack

Setup

Prerequisites

Install Dependencies

Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Farmers for Forests - Document Automation POC

Quick Start

Use Case 1 - Land Record OCR and Extraction

Three Extraction Modes

How It Works (Single Document)

Semantic and Knowledge Graph

Batch Processing

Extracted Fields

Use Case 2 - CC Training Photo Verification

Three Verification Checks

Accept / Reject Logic

Single Photo Mode

Batch PDF Mode

Project Structure

Architecture

UC1 Pipeline Flow

UC2 Pipeline Flow

Tech Stack

Setup

Prerequisites

Install Dependencies

Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages