Smart OCR

smart-ocr is a Node.js OCR library for:

text-based PDFs
scanned PDFs
mixed PDFs with both text-native and scanned pages
PNG and other common raster image formats
optional AI-assisted structured output from extracted OCR text

For PDFs, each page is handled independently. If a page already contains selectable text, Smart OCR extracts it directly. If a page is image-only, it renders the page and falls back to OCR.

Requirements

Node.js >=20.6.0

This package is designed for Node.js. It is not set up for browser use.

Installation

npm install smart-ocr

Quick Start

import { SmartOCR } from "smart-ocr";
import path from "node:path";
import { fileURLToPath } from "node:url";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const ocr = new SmartOCR({ language: "eng", workerCount: 2 });

try {
  const pdfText = await ocr.processPDF(path.join(__dirname, "sample-scanned.pdf"));
  console.log(pdfText);
} finally {
  await ocr.terminate();
}

Structured Output

Smart OCR can optionally turn extracted text into structured JSON.

OCR still runs first
the extracted text is then sent to an AI model to produce structured output

When structuredOutputOptions.ai is configured, processFile(), processPDF(), and processImage() return a JSON object instead of a plain text string.

Supported providers:

openai - uses structured outputs (response_format: json_schema)
anthropic - uses tool use to enforce schema-shaped output
gemini - uses responseMimeType: "application/json" with responseSchema

Example (OpenAI):

import { SmartOCR } from "smart-ocr";

const ocr = new SmartOCR({
  language: "eng",
  structuredOutputOptions: {
    ai: {
      provider: "openai",
      model: "gpt-4.1-mini",
      apiKey: process.env.OPENAI_API_KEY,
      prompt: "Extract the document fields. Use null when a value is missing or unclear.",
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
        dateOfBirth: { type: ["string", "null"] },
        sex: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber", "dateOfBirth", "sex"],
      additionalProperties: false,
    },
  },
});

try {
  const result = await ocr.processFile("./id.pdf");
  console.log(result);
} finally {
  await ocr.terminate();
}

Example (Anthropic):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "anthropic",
      model: "claude-opus-4-5",
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Example (Gemini):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "gemini",
      model: "gemini-2.0-flash",
      apiKey: process.env.GOOGLE_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Notes for AI mode:

apiKey is required for all providers
prompt overrides the default extraction instruction
schema should be a JSON schema describing the object you want back
for OpenAI strict mode, required must list every key in properties
Gemini schemas are automatically normalized: array type values (e.g. ["string", "null"]) are converted to nullable: true, and unsupported fields like additionalProperties are stripped
when AI mode is enabled, the raw OCR text is not returned by these methods

Reference

`new SmartOCR(options?)`

Creates an OCR processor.

Options:

language: Tesseract language or language list. Default: "eng"
pdfRenderScale: render scale used before OCR on scanned PDF pages. Default: 2
workerOptions: options passed to the Tesseract worker, such as langPath, cachePath, or logger
workerCount: Number of OCR workers to run in parallel.
structuredOutputOptions: optional AI configuration for returning structured JSON instead of plain text

Language codes use Tesseract traineddata identifiers, not 2-letter locale codes. For example:

"eng" for English
"spa" for Spanish
"fra" for French
["eng", "spa"] for multilingual OCR

Use "eng", not "en".

structuredOutputOptions shape:

ai.provider: AI provider name. One of "openai", "anthropic", or "gemini"
ai.model: model name to call for structured extraction
ai.apiKey: API key for the chosen provider
ai.prompt: optional custom extraction prompt
schema: JSON schema describing the expected response object. Gemini schemas are automatically normalized from JSON Schema to Gemini's OpenAPI 3.0 subset.

`processFile(filePath)`

Routes a supported file to the correct handler based on file extension.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

Supported extensions:

.pdf
.png
.jpg
.jpeg
.tif
.tiff
.bmp
.webp
.gif

`processPDF(pdfPath)`

Extracts text from a PDF. Text-native pages are read directly. Scanned pages are rendered to images and OCRed.

The OCR language only affects scanned/image-only pages. If a PDF page already contains selectable text, Smart OCR returns that embedded text directly instead of re-OCRing it.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

`processImage(imagePath)`

Runs OCR on an image file.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

`init(language?)`

Eagerly initializes the Tesseract worker. This is optional because processing methods initialize on demand.

If you pass a language to init(language), Smart OCR keeps using that language for later OCR calls until you switch it again or create a new instance.

`terminate()`

Terminates the Tesseract worker and frees resources.

Notes

Smart OCR is optimized for Node.js workloads, not browser runtimes.
Rendering uses @napi-rs/canvas, which avoids the extra Cairo system setup required by canvas.
Scanned PDFs are preprocessed before OCR so sparse content, such as ID cards on large blank pages, is easier to detect.
Structured output is an optional post-processing step on top of OCR, not a replacement for OCR itself.
AI mode supports OpenAI, Anthropic, and Gemini.
OCR quality still depends on the source document quality, scan resolution, and language data.

Development

npm run typecheck
npm run lint
npm test
npm run build
npm run sample

npm run sample builds the library and runs it against the bundled sample files in src/.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.tests.json		tsconfig.tests.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart OCR

Requirements

Installation

Quick Start

Structured Output

Reference

`new SmartOCR(options?)`

`processFile(filePath)`

`processPDF(pdfPath)`

`processImage(imagePath)`

`init(language?)`

`terminate()`

Notes

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart OCR

Requirements

Installation

Quick Start

Structured Output

Reference

new SmartOCR(options?)

processFile(filePath)

processPDF(pdfPath)

processImage(imagePath)

init(language?)

terminate()

Notes

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`new SmartOCR(options?)`

`processFile(filePath)`

`processPDF(pdfPath)`

`processImage(imagePath)`

`init(language?)`

`terminate()`

Packages