Class MultimodalAggregator

Adds auto-generated captions to ExtractedImage objects that lack one, using a caller-supplied vision LLM function.

Images are processed in parallel via Promise.allSettled() so a single failed captioning attempt does not block the rest. Images whose captioning fails retain their original (un-captioned) state rather than propagating the error.

Example — with a vision LLM

const aggregator = new MultimodalAggregator({
describeImage: async (buf, mime) => myVisionLLM.describe(buf, mime),
});

const captioned = await aggregator.processImages(doc.images ?? []);

Example — passthrough (no LLM configured)

const aggregator = new MultimodalAggregator();
const unchanged = await aggregator.processImages(doc.images ?? []);

Constructors

Methods

Constructors

Methods

  • Enrich images with captions via the configured vision LLM.

    Only images that have no existing caption field are processed. Images that already carry a caption are left unchanged to avoid redundant LLM calls.

    When no describeImage function is configured all images are returned unchanged.

    Parameters

    Returns Promise<ExtractedImage[]>

    A promise resolving to the same-length array of ExtractedImage objects, with captions filled in where possible.