Interface SceneDescription

A single scene detected within a video, with timestamps, description, and optional transcript.

Scenes are contiguous segments of video bounded by visual discontinuities (hard cuts, dissolves, fades). The SceneDetector identifies boundaries, and a vision LLM describes the content of each scene.

This is a richer version of the base VideoScene type that includes cut-type classification, confidence, transcript, and key frame data.

interface SceneDescription {
    index: number;
    startSec: number;
    endSec: number;
    durationSec: number;
    cutType: "hard-cut" | "dissolve" | "fade" | "wipe" | "gradual" | "start";
    description: string;
    transcript?: string;
    keyFrame?: string;
    confidence: number;
}

Properties

index: number

0-based scene index within the video.

startSec: number

Start time of the scene in seconds from video start.

endSec: number

End time of the scene in seconds from video start.

durationSec: number

Duration of the scene in seconds (endSec - startSec).

cutType: "hard-cut" | "dissolve" | "fade" | "wipe" | "gradual" | "start"

Type of visual transition that marks the beginning of this scene.

  • 'hard-cut' — Abrupt frame-to-frame change
  • 'dissolve' — Cross-dissolve / superimposition transition
  • 'fade' — Fade from/to black or white
  • 'wipe' — Directional wipe transition
  • 'gradual' — Other gradual transition not fitting the above
  • 'start' — First scene in the video (no preceding transition)
description: string

Natural-language description of the scene content, generated by a vision LLM from the key frame.

transcript?: string

Transcript of speech/narration during this scene's time range. Only populated when audio transcription is enabled.

keyFrame?: string

Base64-encoded key frame image (JPEG) representative of the scene. Typically the frame closest to the scene midpoint.

confidence: number

Confidence score (0-1) for the scene boundary detection. Higher values indicate a more definitive visual discontinuity.