Optional chunkStrategy for splitting a document into indexable chunks.
'fixed' – split at a fixed token/character count.'semantic' – split at semantic boundaries (paragraphs, sections).'hierarchical'– build a tree of coarse → fine chunks (good for Q&A).'layout' – preserve the visual layout of the source (PDF columns etc.).'semantic'
Optional chunkTarget token/character count for each chunk.
512
Optional chunkOverlap between consecutive chunks in tokens/characters. Prevents context loss at chunk boundaries.
64
Optional extractWhether to extract embedded images from documents (PDF, DOCX, etc.).
Extracted images are stored as ExtractedImage objects.
false
Optional ocrWhether to run Optical Character Recognition on extracted images.
Requires extractImages: true.
false
Optional doclingWhether to use the Docling library for high-fidelity PDF/DOCX parsing.
When false, a simpler text-extraction path is used.
false
Optional visionVision-capable LLM model identifier used to caption extracted images.
Only consulted when extractImages: true.
'gpt-4o'
Controls how documents are split into chunks before being stored and indexed.