Class AzureSpeechTTSProvider

Text-to-speech provider that uses the Azure Cognitive Services Speech REST API.

SSML Generation

Azure's TTS REST endpoint requires SSML (Speech Synthesis Markup Language) as the request body — it does not accept plain text. This provider generates minimal SSML via buildSsml() that wraps the input text in <speak> and <voice> elements. Special XML characters in the text are escaped via escapeXml() to prevent malformed XML.

X-Microsoft-OutputFormat Options

The X-Microsoft-OutputFormat header controls the audio encoding. This provider uses 'audio-24khz-96kbitrate-mono-mp3' which provides:

  • 24 kHz sample rate (high quality for speech)
  • 96 kbps bitrate (good balance of quality and file size)
  • Mono channel (sufficient for speech synthesis)
  • MP3 format (universally supported)

Other available formats include:

  • 'audio-16khz-128kbitrate-mono-mp3' — Lower sample rate, higher bitrate
  • 'audio-24khz-160kbitrate-mono-mp3' — Higher bitrate for better quality
  • 'riff-24khz-16bit-mono-pcm' — Uncompressed WAV
  • 'ogg-24khz-16bit-mono-opus' — Opus codec in OGG container

See

Voice Listing

The listAvailableVoices method fetches the full list of neural voices available in the configured Azure region via GET /cognitiveservices/voices/list. Results are mapped to the normalized SpeechVoice shape.

Example

const provider = new AzureSpeechTTSProvider({
key: process.env.AZURE_SPEECH_KEY!,
region: 'eastus',
defaultVoice: 'en-US-GuyNeural',
});
const result = await provider.synthesize('Hello world');
// result.audioBuffer contains MP3 bytes
// result.mimeType === 'audio/mpeg'

Implements

Constructors

Methods

  • Synthesizes speech from plain text using the Azure TTS REST endpoint.

    The text is wrapped in SSML, sent to Azure, and the response audio buffer (MP3 format) is returned along with metadata.

    Parameters

    • text: string

      The plain-text utterance to convert to audio. XML special characters are automatically escaped.

    • options: SpeechSynthesisOptions = {}

      Optional synthesis settings. Use options.voice to override the default voice with any valid Azure voice short-name.

    Returns Promise<SpeechSynthesisResult>

    A promise resolving to the MP3 audio buffer and metadata.

    Throws

    When the Azure API returns a non-2xx status code. Common causes: invalid subscription key (401), region mismatch (404), invalid SSML (400), or quota exceeded (429).

    Example

    const result = await provider.synthesize('Guten Tag!', {
    voice: 'de-DE-ConradNeural',
    });
    fs.writeFileSync('output.mp3', result.audioBuffer);
  • Retrieves the list of available neural voices from the Azure region.

    Fetches from GET /cognitiveservices/voices/list and maps each entry to the normalized SpeechVoice shape. The list includes all neural and standard voices available in the configured region.

    Returns Promise<SpeechVoice[]>

    A promise resolving to an array of normalized voice entries.

    Throws

    When the Azure API returns a non-2xx status code (e.g. invalid key, network error).

    Example

    const voices = await provider.listAvailableVoices();
    const englishVoices = voices.filter(v => v.lang.startsWith('en-'));
    console.log(`Found ${englishVoices.length} English voices`);

Properties

id: "azure-speech-tts" = 'azure-speech-tts'

Unique provider identifier used for registration and resolution.

displayName: "Azure Speech (TTS)" = 'Azure Speech (TTS)'

Human-readable display name for UI and logging.

supportsStreaming: true = true

Marked as streaming-capable because the provider can be used within a streaming pipeline — though the actual HTTP request is a single synchronous call that returns the complete audio buffer.