Class AzureSpeechSTTProvider

Speech-to-text provider that uses the Azure Cognitive Services Speech REST API.

Azure REST Endpoint Format

The endpoint URL follows this pattern:

https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language={lang}
  • {region} — The Azure region from config (e.g. eastus, westeurope).
  • {lang} — BCP-47 language code from options or 'en-US' default.
  • The /conversation/ path segment selects the conversation recognition mode (as opposed to /interactive/ or /dictation/).

Authentication: Ocp-Apim-Subscription-Key

Azure Cognitive Services uses the Ocp-Apim-Subscription-Key HTTP header for authentication, which differs from the typical Authorization: Bearer pattern. The subscription key is sent as a plain-text header value — no "Bearer" or "Token" prefix.

An alternative is to use a short-lived token from the token endpoint, but this provider uses the simpler key-based approach for reliability.

NoMatch Handling

When Azure's recognizer detects audio but cannot identify any speech, it returns RecognitionStatus: 'NoMatch' instead of raising an HTTP error. This provider maps NoMatch to an empty-text result (text: '') with isFinal: true, matching the Azure Speech SDK's behaviour. This prevents the fallback proxy from unnecessarily trying another provider when the audio genuinely contains no speech.

Limitations

  • Audio must be PCM WAV format. The Content-Type is hardcoded to audio/wav regardless of the audio.mimeType value.
  • Streaming is not supported — use the Azure Speech SDK for real-time STT.
  • Speaker diarization is not available via the REST API.

See

Example

const provider = new AzureSpeechSTTProvider({
key: process.env.AZURE_SPEECH_KEY!,
region: 'eastus',
});
const result = await provider.transcribe(
{ data: wavBuffer, mimeType: 'audio/wav' },
{ language: 'de-DE' },
);
console.log(result.text); // '' if no speech detected

Implements

Constructors

Methods

  • Transcribes an audio buffer using the Azure Speech recognition REST endpoint.

    Sends the raw audio as PCM WAV and returns a normalized result. Azure's NoMatch status is treated as an empty transcript (not an error).

    Parameters

    • audio: SpeechAudioInput

      Raw audio data. Azure expects PCM WAV format; the Content-Type header is always set to 'audio/wav' regardless of audio.mimeType.

    • options: SpeechTranscriptionOptions = {}

      Optional transcription settings. Only language is supported by the Azure REST endpoint.

    Returns Promise<SpeechTranscriptionResult>

    A promise resolving to the normalized transcription result.

    Throws

    When the Azure API returns a non-2xx HTTP status code. The error message includes the status and response body text.

    Example

    const result = await provider.transcribe(
    { data: wavBuffer, durationSeconds: 5 },
    { language: 'fr-FR' },
    );
    if (result.text === '') {
    console.log('No speech detected in the audio');
    }

Properties

id: "azure-speech-stt" = 'azure-speech-stt'

Unique provider identifier used for registration and resolution.

displayName: "Azure Speech (STT)" = 'Azure Speech (STT)'

Human-readable display name for UI and logging.

supportsStreaming: false = false

This provider uses synchronous HTTP requests, not WebSocket streaming.