Create a new multimodal indexer.
Dependency injection container.
Manager for generating text embeddings.
Vector store for document storage and search.
Optional visionOptional vision LLM for image description.
Optional visionOptional full vision pipeline with OCR, handwriting,
document understanding, CLIP embeddings, and cloud fallback. When provided,
it is wrapped as an IVisionProvider via PipelineVisionProvider,
overriding any visionProvider passed alongside it.
Optional sttOptional STT provider for audio transcription.
Optional config?: MultimodalIndexerConfigOptional configuration overrides.
If embeddingManager or vectorStore is missing.
// With a simple vision LLM provider
const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionProvider: myVisionLLM,
sttProvider: myWhisperService,
config: { defaultCollection: 'knowledge' },
});
// With the full vision pipeline (recommended)
const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionPipeline: myVisionPipeline,
});
Attach a HyDE retriever to enable hypothesis-driven multimodal search.
Once set, pass hyde: { enabled: true } in the search() options to
activate HyDE for that query. The retriever generates a hypothetical
answer using an LLM, then embeds that answer instead of the raw query
text, which typically yields better recall for exploratory queries.
A pre-configured HydeRetriever instance.
indexer.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager: myEmbeddingManager,
config: { enabled: true },
}));
const results = await indexer.search('cats on a beach', {
hyde: { enabled: true },
});
Index an image by generating a text description via vision LLM, then embedding and storing the description.
Image data, metadata, and collection options.
The document ID and generated description.
If no vision provider is configured.
If the vision LLM fails to describe the image.
If embedding generation or vector store upsert fails.
const result = await indexer.indexImage({
image: 'https://example.com/photo.jpg',
metadata: { source: 'web-scrape', url: 'https://example.com' },
});
console.log(result.description); // "A golden retriever playing fetch..."
Index an audio file by transcribing via STT, then embedding and storing the transcript.
Audio data, metadata, collection, and language options.
The document ID and generated transcript.
If no STT provider is configured.
If the STT provider fails to transcribe.
If embedding generation or vector store upsert fails.
const result = await indexer.indexAudio({
audio: fs.readFileSync('./podcast.mp3'),
metadata: { source: 'podcast', episode: 42 },
language: 'en',
});
console.log(result.transcript); // "Welcome to episode 42..."
Search across all modalities (text + image descriptions + audio transcripts).
The query text is embedded, then the vector store is searched with optional modality filtering. Results are returned with their source modality indicated.
Natural language search query.
Optional opts: MultimodalSearchOptionsOptional search parameters (topK, modalities, collection).
Array of search results sorted by relevance score (descending).
If embedding generation fails.
// Search only image descriptions
const imageResults = await indexer.search('cats playing', {
modalities: ['image'],
topK: 10,
});
// Search across all modalities
const allResults = await indexer.search('machine learning tutorial');
Create a MultimodalMemoryBridge using this indexer's providers.
The bridge extends this indexer's RAG capabilities with cognitive memory integration, enabling multimodal content to be stored in both the vector store (for search) and long-term memory (for recall during conversation).
Optional memoryManager: ICognitiveMemoryManagerOptional cognitive memory manager for memory trace creation. When omitted, the bridge still indexes into RAG but creates no memory traces.
Optional options: MultimodalBridgeOptionsBridge configuration overrides (mood, chunk sizes, etc.)
A configured multimodal memory bridge instance.
const bridge = indexer.createMemoryBridge(memoryManager, {
enableMemory: true,
defaultChunkSize: 800,
});
await bridge.ingestImage(imageBuffer, { source: 'user-upload' });
See MultimodalMemoryBridge for full documentation.
Indexes non-text content (images, audio) into the vector store by generating text descriptions and embeddings.
Image indexing flow
modality: 'image'metadata.Audio indexing flow
modality: 'audio'metadata.Cross-modal search
Example