Create a new vision pipeline.
Pipeline configuration. All heavy dependencies are loaded lazily, so construction is synchronous and never imports ML libraries.
const pipeline = new VisionPipeline({
strategy: 'progressive',
ocr: 'paddle',
handwriting: true,
cloudProvider: 'openai',
});
Process an image through the configured tiers.
Automatically detects content type (printed text, handwritten, diagram, etc.) and routes through the appropriate processing tiers based on the configured VisionStrategy.
Image data as a Buffer or file-path / URL string. Buffers are preprocessed with sharp (if configured). URL strings are passed directly to providers that support them.
Optional options: { Optional overrides for this specific invocation.
Optional forceForce a specific content category instead of auto-detecting from OCR confidence heuristics.
Optional tiers?: VisionTier[]Run only these specific tiers, ignoring the strategy's normal routing logic.
Aggregated vision result with text, confidence, embeddings, etc.
If all configured tiers fail to produce a result.
If a required dependency (e.g. ppu-paddle-ocr) is missing.
If dispose() was already called.
// Full progressive pipeline
const result = await pipeline.process(imageBuffer);
// Force handwriting mode
const hw = await pipeline.process(scanBuffer, {
forceCategory: 'handwritten',
});
// Only run OCR and embedding, skip everything else
const partial = await pipeline.process(imageBuffer, {
tiers: ['ocr', 'embedding'],
});
Extract text only — fastest path using OCR tier exclusively.
Ignores all other tiers (handwriting, document-ai, cloud, embedding). Useful when you just need the text content and don't need confidence scoring, layout analysis, or embeddings.
Image data as a Buffer or file-path / URL string.
Extracted text, or empty string if OCR produces no output.
If the configured OCR engine is missing.
const text = await pipeline.extractText(receiptImage);
console.log(text); // "ACME STORE\n...\nTotal: $42.99"
Generate an image embedding using CLIP — embedding tier only.
Useful for building image similarity search indexes without running the full OCR + vision pipeline.
Image data as a Buffer or file-path / URL string.
CLIP embedding vector (typically 512 or 768 dimensions).
If @huggingface/transformers is not installed.
If CLIP model loading fails.
const embedding = await pipeline.embed(photoBuffer);
await vectorStore.upsert('images', [{
id: 'photo-1',
embedding,
metadata: { source: 'upload' },
}]);
Analyze document layout using Florence-2 — document-ai tier only.
Returns structured DocumentLayout with semantic blocks (text, tables, figures, headings, lists, code) and their bounding boxes within each page.
Image data as a Buffer or file-path / URL string.
Structured document layout with pages and blocks.
If @huggingface/transformers is not installed.
If Florence-2 model loading fails.
const layout = await pipeline.analyzeLayout(documentScan);
for (const page of layout.pages) {
for (const block of page.blocks) {
console.log(`${block.type}: ${block.content.slice(0, 50)}...`);
}
}
Shut down the pipeline and release all loaded model resources.
After calling dispose(), any further calls to process(),
extractText(), embed(), or analyzeLayout() will throw.
const pipeline = new VisionPipeline({ strategy: 'progressive' });
try {
const result = await pipeline.process(image);
} finally {
await pipeline.dispose();
}
Unified vision pipeline with progressive enhancement.
Processes images through up to three tiers of increasing capability:
All heavy dependencies are loaded lazily on first use. The pipeline never imports ML libraries at module load time, so it's safe to instantiate even when optional peer deps are missing — errors only surface when a tier that needs them actually runs.
See
createVisionPipeline for automatic provider detection.