Skip to content

ThaDeveloper/smart-ocr

Repository files navigation

Smart OCR

npm version CI License No Known Security Issues

smart-ocr is a Node.js OCR library for:

  • text-based PDFs
  • scanned PDFs
  • mixed PDFs with both text-native and scanned pages
  • PNG and other common raster image formats
  • optional AI-assisted structured output from extracted OCR text

For PDFs, each page is handled independently. If a page already contains selectable text, Smart OCR extracts it directly. If a page is image-only, it renders the page and falls back to OCR.

Requirements

  • Node.js >=20.6.0

This package is designed for Node.js. It is not set up for browser use.

Installation

npm install smart-ocr

Quick Start

import { SmartOCR } from "smart-ocr";
import path from "node:path";
import { fileURLToPath } from "node:url";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const ocr = new SmartOCR({ language: "eng", workerCount: 2 });

try {
  const pdfText = await ocr.processPDF(path.join(__dirname, "sample-scanned.pdf"));
  console.log(pdfText);
} finally {
  await ocr.terminate();
}

Structured Output

Smart OCR can optionally turn extracted text into structured JSON.

  • OCR still runs first
  • the extracted text is then sent to an AI model to produce structured output

When structuredOutputOptions.ai is configured, processFile(), processPDF(), and processImage() return a JSON object instead of a plain text string.

Supported providers:

  • openai - uses structured outputs (response_format: json_schema)
  • anthropic - uses tool use to enforce schema-shaped output
  • gemini - uses responseMimeType: "application/json" with responseSchema

Example (OpenAI):

import { SmartOCR } from "smart-ocr";

const ocr = new SmartOCR({
  language: "eng",
  structuredOutputOptions: {
    ai: {
      provider: "openai",
      model: "gpt-4.1-mini",
      apiKey: process.env.OPENAI_API_KEY,
      prompt: "Extract the document fields. Use null when a value is missing or unclear.",
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
        dateOfBirth: { type: ["string", "null"] },
        sex: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber", "dateOfBirth", "sex"],
      additionalProperties: false,
    },
  },
});

try {
  const result = await ocr.processFile("./id.pdf");
  console.log(result);
} finally {
  await ocr.terminate();
}

Example (Anthropic):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "anthropic",
      model: "claude-opus-4-5",
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Example (Gemini):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "gemini",
      model: "gemini-2.0-flash",
      apiKey: process.env.GOOGLE_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Notes for AI mode:

  • apiKey is required for all providers
  • prompt overrides the default extraction instruction
  • schema should be a JSON schema describing the object you want back
  • for OpenAI strict mode, required must list every key in properties
  • Gemini schemas are automatically normalized: array type values (e.g. ["string", "null"]) are converted to nullable: true, and unsupported fields like additionalProperties are stripped
  • when AI mode is enabled, the raw OCR text is not returned by these methods

Reference

new SmartOCR(options?)

Creates an OCR processor.

Options:

  • language: Tesseract language or language list. Default: "eng"
  • pdfRenderScale: render scale used before OCR on scanned PDF pages. Default: 2
  • workerOptions: options passed to the Tesseract worker, such as langPath, cachePath, or logger
  • workerCount: Number of OCR workers to run in parallel.
  • structuredOutputOptions: optional AI configuration for returning structured JSON instead of plain text

Language codes use Tesseract traineddata identifiers, not 2-letter locale codes. For example:

  • "eng" for English
  • "spa" for Spanish
  • "fra" for French
  • ["eng", "spa"] for multilingual OCR

Use "eng", not "en".

structuredOutputOptions shape:

  • ai.provider: AI provider name. One of "openai", "anthropic", or "gemini"
  • ai.model: model name to call for structured extraction
  • ai.apiKey: API key for the chosen provider
  • ai.prompt: optional custom extraction prompt
  • schema: JSON schema describing the expected response object. Gemini schemas are automatically normalized from JSON Schema to Gemini's OpenAPI 3.0 subset.

processFile(filePath)

Routes a supported file to the correct handler based on file extension.

Returns:

  • extracted text by default
  • structured JSON when structuredOutputOptions.ai is configured

Supported extensions:

  • .pdf
  • .png
  • .jpg
  • .jpeg
  • .tif
  • .tiff
  • .bmp
  • .webp
  • .gif

processPDF(pdfPath)

Extracts text from a PDF. Text-native pages are read directly. Scanned pages are rendered to images and OCRed.

The OCR language only affects scanned/image-only pages. If a PDF page already contains selectable text, Smart OCR returns that embedded text directly instead of re-OCRing it.

Returns:

  • extracted text by default
  • structured JSON when structuredOutputOptions.ai is configured

processImage(imagePath)

Runs OCR on an image file.

Returns:

  • extracted text by default
  • structured JSON when structuredOutputOptions.ai is configured

init(language?)

Eagerly initializes the Tesseract worker. This is optional because processing methods initialize on demand.

If you pass a language to init(language), Smart OCR keeps using that language for later OCR calls until you switch it again or create a new instance.

terminate()

Terminates the Tesseract worker and frees resources.

Notes

  • Smart OCR is optimized for Node.js workloads, not browser runtimes.
  • Rendering uses @napi-rs/canvas, which avoids the extra Cairo system setup required by canvas.
  • Scanned PDFs are preprocessed before OCR so sparse content, such as ID cards on large blank pages, is easier to detect.
  • Structured output is an optional post-processing step on top of OCR, not a replacement for OCR itself.
  • AI mode supports OpenAI, Anthropic, and Gemini.
  • OCR quality still depends on the source document quality, scan resolution, and language data.

Development

npm run typecheck
npm run lint
npm test
npm run build
npm run sample

npm run sample builds the library and runs it against the bundled sample files in src/.

License

MIT

About

OCR library for both scanned and text-based PDFs in .pdf or image format using tesseract.js with AI-powered structured output support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors