Skip to content

virajmandlik/POC-Test

Repository files navigation

Farmers for Forests - Document Automation POC

Cisco x F4F Hackathon | March 2026

Automation toolkit for Farmers for Forests (F4F) farmer onboarding workflows. Two Streamlit apps that replace manual document verification with AI-powered pipelines.


Quick Start

# Activate the virtual environment
venv312\Scripts\activate

# Run Use Case 1 - Land Record OCR
streamlit run usecase1_land_record_ocr.py

# Run Use Case 2 - CC Photo Verification
streamlit run usecase2_photo_verification.py

Requirement: Set CXAI_API_KEY in .env for Vision and Combined modes.


Use Case 1 - Land Record OCR and Extraction

File: usecase1_land_record_ocr.py

Extracts structured data from Maharashtra 7/12 (Saat-Baara) land record documents (PDFs or images) into a standardised JSON schema.

Three Extraction Modes

  • PaddleOCR (Offline) - Runs paddleocr_pdf_to_json_demo.py as a subprocess. No API key needed.
  • Vision (Online) - Sends document image to GPT-4 Vision API for direct extraction.
  • Combined (Recommended) - PaddleOCR + Vision run in parallel. GPT-4o-mini merges both results into a single output, resolving conflicts between sources.

How It Works (Single Document)

  1. Upload a PDF or image
  2. Quality Gate checks blur, brightness, contrast, resolution, and skew
    • If PASS: skip preprocessing and go straight to extraction
    • If FAIL: interactive enhancement panel (contrast, denoise, deskew, threshold)
  3. Run extraction on raw document (and preprocessed version if applicable)
  4. Comparative analysis via GPT-4o-mini (raw vs preprocessed accuracy)
  5. Semantic and Knowledge Graph - infers ownership chain, maps encumbrances, and renders a visual relationship graph
  6. Final output saved as structured JSON

Semantic and Knowledge Graph

After extraction, GPT-4o-mini performs semantic analysis on the structured JSON to produce:

  • Original owner identification from mutation reference numbers
  • Ownership chain - who transferred land to whom, via which mutation, transfer type (inheritance / sale / partition)
  • Current owners with account numbers, area shares, and assessment amounts
  • Encumbrance mapping - each loan/mortgage linked to its specific owner and bank
  • Water resources - wells with owners and mutation references
  • Interactive Graphviz graph visualising the full ownership and encumbrance network

Graph node legend:

Node Colour Meaning
Yellow box #FFF9C4 Original / historical owner
Green box (bold) #C8E6C9 Current owner
Red hexagon #FFCDD2 Bank encumbrance
Blue diamond #B3E5FC Well
Green ellipse #C8E6C9 Land parcel

The semantic knowledge graph is included in the final saved JSON under the semantic_knowledge_graph key.

Batch Processing

  • Upload multiple PDFs/images or scan a local folder
  • Queue-based processing with live progress table
  • Choose extraction mode per batch (combined / paddle / vision)
  • CSV export with key fields: district, taluka, village, survey numbers, owners, area
  • Expand any row to inspect full extraction JSON or run semantic analysis per document

Extracted Fields

Category Fields
Location state, district, taluka, village, village_code, pu_id
Land survey_number, sub_division, local_name, tenure
Owners name, account_number, area_hectare, assessment_rupees, mutation_ref
Area cultivable (jirayat / bagayat), uncultivable (class A / B), pot_kharab, total_area_hectare
Assessment base_rupees, special_rupees, total_rupees
Mutation last_number, last_date, pending, all_numbers
Encumbrances type, bank_name, branch, amount_rupees, borrower_name, mutation_ref
Water wells (owner, mutation), irrigation
Signature date, verification_url, reference_number
Comparison fields_differing, paddle_only, vision_only

Use Case 2 - CC Training Photo Verification

File: usecase2_photo_verification.py

Verifies photographic evidence submitted for carbon credit training sessions. Each photo must prove a training event occurred with identifiable participants, location, and timestamp.

Three Verification Checks

  1. Image Quality (OpenCV) - Blur score, brightness, contrast
  2. Scene Analysis (GPT-4 Vision) - People count, F4F representative present, training context, outdoor/rural setting
  3. Metadata Extraction (GPT-4 Vision) - GPS coordinates and date/time from photo overlay (e.g. GPS Map Camera app)

Accept / Reject Logic

A photo is ACCEPTED only if ALL of these pass:

  • Image quality is acceptable (not blurry, not too dark/bright)
  • Multiple people are visible in the frame
  • Scene looks like a training session (not a selfie, not indoors in a city)
  • GPS coordinates found in photo overlay
  • Date/time found in photo overlay

If any check fails, the photo is REJECTED with specific reasons.

Single Photo Mode

Upload a JPEG/PNG and walk through 4 tabs:

  1. Upload Photo
  2. Quality Check results
  3. Scene and Metadata analysis (with GPS map)
  4. Final Verdict (Accept/Reject with full breakdown)

Batch PDF Mode

  • Upload CC training PDFs (filename pattern: {surrogate_key}-{FID}-{LID}.pdf)
  • Automatically extracts training photo from page 3 of each PDF
  • Live-updating results table with Accept / Reject / Error counts
  • CSV export with all verification fields
  • Click any row to see the extracted photo and full result JSON

Project Structure

usecase1_land_record_ocr.py          --> UC1: Land Record OCR (Streamlit app)
usecase2_photo_verification.py       --> UC2: CC Photo Verification (Streamlit app)
paddleocr_pdf_to_json_demo.py        --> PaddleOCR subprocess worker (Python 3.12)
.env                                 --> CXAI_API_KEY (not committed)

venv312/                             --> Python 3.12 virtual environment

cc_data_final/                       --> Sample CC training PDFs
uploads/                             --> Uploaded documents (auto-created)
output/                              --> Extraction results (auto-created)
  ocr_output.json                        PaddleOCR raw output
  raw_combined.json                      UC1 raw pipeline result
  prep_combined.json                     UC1 preprocessed pipeline result
  comparative_output.json                UC1 final merged output + semantic graph
  cc_verification_results.csv            UC2 batch results
  uc1_extraction_results.csv             UC1 batch results

Architecture

UC1 Pipeline Flow

PDF/Image
    |
    v
Quality Gate (OpenCV)
    |
    +---> [Enhancement if needed]
    |                   |
    v                   v
PaddleOCR (subprocess)  +  GPT-4 Vision (API)
    |                           |
    +-------------+-------------+
                  |
                  v
          GPT-4o-mini Merge
                  |
                  v
          Structured JSON Output
                  |
                  v
          GPT-4o-mini Semantic Analysis
                  |
                  v
          Ownership Chain + Knowledge Graph

UC2 Pipeline Flow

PDF
    |
    v
Extract Page 3 (training photo)
    |
    v
OpenCV Quality Check
    |
    v
GPT-4 Vision (scene analysis + overlay extraction)
    |
    v
Accept / Reject Decision

Tech Stack

Component Technology
UI Streamlit
OCR (offline) PaddleOCR via Python 3.12 subprocess
Vision AI GPT-4 Vision via CXAI Playground API
Data merge GPT-4o-mini
Image processing OpenCV, Pillow
PDF rendering pypdfium2
Graph visualisation Graphviz (built-in Streamlit support)
Languages Marathi, Hindi, English (Devanagari script)

Setup

Prerequisites

  • Python 3.12+ (for PaddleOCR subprocess)
  • Python 3.10+ (for Streamlit UI)
  • CXAI_API_KEY in .env

Install Dependencies

python -m venv venv312
venv312\Scripts\activate
pip install streamlit pandas opencv-python-headless numpy Pillow pypdfium2 requests python-dotenv paddleocr paddlepaddle

Design Decisions

  • Self-contained files - Each use case is a single .py file with zero cross-imports. Easy to deploy independently.
  • Quality gate before extraction - Prevents wasting API calls on unreadable documents.
  • Parallel extraction - PaddleOCR and Vision API run concurrently in Combined mode to cut wall-clock time.
  • CSV with file-lock retry - Handles Windows file-locking when CSV is open in Excel during batch runs.
  • Preprocessing is optional - The quality gate decides. Over-processing clean documents actually degrades accuracy.
  • Semantic analysis is on-demand - Runs only when the user clicks the button, avoiding unnecessary API costs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors