Class VisionPipeline

Unified vision pipeline with progressive enhancement.

Processes images through up to three tiers of increasing capability:

  1. Local OCR (PaddleOCR / Tesseract.js) — fast, free, offline
  2. Local Vision Models (TrOCR / Florence-2 / CLIP) — offline but slower
  3. Cloud Vision LLMs (GPT-4o, Claude, Gemini) — best quality, API cost

All heavy dependencies are loaded lazily on first use. The pipeline never imports ML libraries at module load time, so it's safe to instantiate even when optional peer deps are missing — errors only surface when a tier that needs them actually runs.

See

createVisionPipeline for automatic provider detection.

Constructors

  • Create a new vision pipeline.

    Parameters

    • config: VisionPipelineConfig

      Pipeline configuration. All heavy dependencies are loaded lazily, so construction is synchronous and never imports ML libraries.

    Returns VisionPipeline

    Example

    const pipeline = new VisionPipeline({
    strategy: 'progressive',
    ocr: 'paddle',
    handwriting: true,
    cloudProvider: 'openai',
    });

Methods

  • Process an image through the configured tiers.

    Automatically detects content type (printed text, handwritten, diagram, etc.) and routes through the appropriate processing tiers based on the configured VisionStrategy.

    Parameters

    • image: string | Buffer<ArrayBufferLike>

      Image data as a Buffer or file-path / URL string. Buffers are preprocessed with sharp (if configured). URL strings are passed directly to providers that support them.

    • Optional options: {
          forceCategory?: VisionContentCategory;
          tiers?: VisionTier[];
      }

      Optional overrides for this specific invocation.

      • Optional forceCategory?: VisionContentCategory

        Force a specific content category instead of auto-detecting from OCR confidence heuristics.

      • Optional tiers?: VisionTier[]

        Run only these specific tiers, ignoring the strategy's normal routing logic.

    Returns Promise<VisionResult>

    Aggregated vision result with text, confidence, embeddings, etc.

    Throws

    If all configured tiers fail to produce a result.

    Throws

    If a required dependency (e.g. ppu-paddle-ocr) is missing.

    Throws

    If dispose() was already called.

    Example

    // Full progressive pipeline
    const result = await pipeline.process(imageBuffer);

    // Force handwriting mode
    const hw = await pipeline.process(scanBuffer, {
    forceCategory: 'handwritten',
    });

    // Only run OCR and embedding, skip everything else
    const partial = await pipeline.process(imageBuffer, {
    tiers: ['ocr', 'embedding'],
    });
  • Extract text only — fastest path using OCR tier exclusively.

    Ignores all other tiers (handwriting, document-ai, cloud, embedding). Useful when you just need the text content and don't need confidence scoring, layout analysis, or embeddings.

    Parameters

    • image: string | Buffer<ArrayBufferLike>

      Image data as a Buffer or file-path / URL string.

    Returns Promise<string>

    Extracted text, or empty string if OCR produces no output.

    Throws

    If the configured OCR engine is missing.

    Example

    const text = await pipeline.extractText(receiptImage);
    console.log(text); // "ACME STORE\n...\nTotal: $42.99"
  • Generate an image embedding using CLIP — embedding tier only.

    Useful for building image similarity search indexes without running the full OCR + vision pipeline.

    Parameters

    • image: string | Buffer<ArrayBufferLike>

      Image data as a Buffer or file-path / URL string.

    Returns Promise<number[]>

    CLIP embedding vector (typically 512 or 768 dimensions).

    Throws

    If @huggingface/transformers is not installed.

    Throws

    If CLIP model loading fails.

    Example

    const embedding = await pipeline.embed(photoBuffer);
    await vectorStore.upsert('images', [{
    id: 'photo-1',
    embedding,
    metadata: { source: 'upload' },
    }]);
  • Analyze document layout using Florence-2 — document-ai tier only.

    Returns structured DocumentLayout with semantic blocks (text, tables, figures, headings, lists, code) and their bounding boxes within each page.

    Parameters

    • image: string | Buffer<ArrayBufferLike>

      Image data as a Buffer or file-path / URL string.

    Returns Promise<DocumentLayout>

    Structured document layout with pages and blocks.

    Throws

    If @huggingface/transformers is not installed.

    Throws

    If Florence-2 model loading fails.

    Example

    const layout = await pipeline.analyzeLayout(documentScan);
    for (const page of layout.pages) {
    for (const block of page.blocks) {
    console.log(`${block.type}: ${block.content.slice(0, 50)}...`);
    }
    }
  • Shut down the pipeline and release all loaded model resources.

    After calling dispose(), any further calls to process(), extractText(), embed(), or analyzeLayout() will throw.

    Returns Promise<void>

    Example

    const pipeline = new VisionPipeline({ strategy: 'progressive' });
    try {
    const result = await pipeline.process(image);
    } finally {
    await pipeline.dispose();
    }