Interface IngestionConfig

Controls how documents are split into chunks before being stored and indexed.

interface IngestionConfig {
    chunkStrategy?: "hierarchical" | "fixed" | "semantic" | "layout";
    chunkSize?: number;
    chunkOverlap?: number;
    extractImages?: boolean;
    ocrEnabled?: boolean;
    doclingEnabled?: boolean;
    visionLlm?: string;
}

Properties

chunkStrategy?: "hierarchical" | "fixed" | "semantic" | "layout"

Strategy for splitting a document into indexable chunks.

  • 'fixed' – split at a fixed token/character count.
  • 'semantic' – split at semantic boundaries (paragraphs, sections).
  • 'hierarchical'– build a tree of coarse → fine chunks (good for Q&A).
  • 'layout' – preserve the visual layout of the source (PDF columns etc.).

Default

'semantic'
chunkSize?: number

Target token/character count for each chunk.

Default

512
chunkOverlap?: number

Overlap between consecutive chunks in tokens/characters. Prevents context loss at chunk boundaries.

Default

64
extractImages?: boolean

Whether to extract embedded images from documents (PDF, DOCX, etc.). Extracted images are stored as ExtractedImage objects.

Default

false
ocrEnabled?: boolean

Whether to run Optical Character Recognition on extracted images. Requires extractImages: true.

Default

false
doclingEnabled?: boolean

Whether to use the Docling library for high-fidelity PDF/DOCX parsing. When false, a simpler text-extraction path is used.

Default

false
visionLlm?: string

Vision-capable LLM model identifier used to caption extracted images. Only consulted when extractImages: true.

Example

'gpt-4o'