0-based scene index within the video.
Start time of the scene in seconds from video start.
End time of the scene in seconds from video start.
Duration of the scene in seconds (endSec - startSec).
Type of visual transition that marks the beginning of this scene.
'hard-cut' — Abrupt frame-to-frame change'dissolve' — Cross-dissolve / superimposition transition'fade' — Fade from/to black or white'wipe' — Directional wipe transition'gradual' — Other gradual transition not fitting the above'start' — First scene in the video (no preceding transition)Natural-language description of the scene content, generated by a vision LLM from the key frame.
Optional transcriptTranscript of speech/narration during this scene's time range. Only populated when audio transcription is enabled.
Optional keyBase64-encoded key frame image (JPEG) representative of the scene. Typically the frame closest to the scene midpoint.
Confidence score (0-1) for the scene boundary detection. Higher values indicate a more definitive visual discontinuity.
A single scene detected within a video, with timestamps, description, and optional transcript.
Scenes are contiguous segments of video bounded by visual discontinuities (hard cuts, dissolves, fades). The SceneDetector identifies boundaries, and a vision LLM describes the content of each scene.
This is a richer version of the base VideoScene type that includes cut-type classification, confidence, transcript, and key frame data.