Voice and Speech AI in Blazor: Whisper, TTS, and Voice Assistants

Typing is no longer the only input modality your customers expect. Drivers want hands-free voice. Field workers want to dictate inspection notes. Meeting attendees want auto-transcripts. Accessibility users want screen content read aloud. Customer service teams want call recordings searchable and summarized. All of this is voice and speech AI, and the 2026 .NET 10 stack ships with everything needed: Whisper-quality transcription, neural text-to-speech, real-time audio streaming over SignalR, and Blazor components that handle the microphone permission dance, recording UX, and streaming transcript display.

This guide walks through five voice patterns we see in production .NET applications: meeting transcription, voice forms, voice assistants, accessibility narration, and call-center analytics. Every example is buildable on Windows + IIS with the AWH stack.

STT + TTSBoth directions

.NET 10LTS runtime

Real-timeStreaming over SignalR

The voice AI ecosystem in .NET 10

Whisper (hosted or local)

OpenAI Whisper via API or self-hosted via Whisper.net (ONNX bindings). 99+ languages, robust on noisy audio.

✅ Text-to-speech

Azure Speech, ElevenLabs

Neural TTS with natural prosody. SSML for fine control. Streaming output so the audio starts playing within ~300 ms of request.

✅ Audio capture

Browser MediaRecorder

Native browser API for mic capture. Streams chunks to a Blazor JS interop wrapper, then over SignalR to the server.

🟡 Required reading

Audio formats

Whisper expects WAV / MP3 / WebM / M4A at 16kHz mono is the sweet spot. Convert client- or server-side via NAudio or FFmpeg before transcription.

🟡 UX patterns

Voice Activity Detection

Detect when the user stops speaking so you don't transcribe silence. vad.js on the client or Silero VAD via ONNX on the server.

🟡 Optional

Diarization (who-said-what)

For multi-speaker audio (meetings, interviews, calls), pyannote-style models label each speech turn. Use a hosted API; self-hosting needs Python.

Quick reference: five voice patterns

  • Batch transcription: the foundation

Batch transcription is the simplest voice feature you can ship and the most universally useful — every recorded meeting, podcast, interview, voicemail, and webinar in your business is a candidate. Upload audio, get a transcript back, store both. Build everything else on top of this primitive.

A Blazor upload component, a Hangfire background job for processing, and a single API call to a transcription service:

// Blazor: file upload

<InputFile OnChange="OnAudioSelected" accept=".mp3,.wav,.m4a,.webm,.ogg" />

@code {

async Task OnAudioSelected(InputFileChangeEventArgs e)

{

var file = e.File;

await using var stream = file.OpenReadStream(maxAllowedSize: 100 1024 1024);

var recordingId = await _recordings.UploadAsync(stream, file.Name, file.ContentType);

// Enqueue background transcription

BackgroundJob.Enqueue<TranscriptionJob>(j => j.RunAsync(recordingId));

Navigation.NavigateTo($"/recordings/{recordingId}");

}

}

// The background job

public class TranscriptionJob(ITranscriber transcriber, IRecordingStore store)

{

public async Task RunAsync(int recordingId)

{

var recording = await store.GetAsync(recordingId);

await using var audio = await store.OpenAudioAsync(recordingId);

var transcript = await transcriber.TranscribeAsync(audio, recording.MimeType);

await store.SaveTranscriptAsync(recordingId, transcript);

}

}

The transcriber wraps either a hosted Whisper API or a local Whisper.net instance behind a common interface:

public interface ITranscriber

{

Task<TranscriptionResult> TranscribeAsync(Stream audio, string mimeType, TranscriptionOptions? options = null);

}

public record TranscriptionResult(string Text, string Language, List<Segment> Segments);

public record Segment(double Start, double End, string Text, double Confidence);

  • Real-time transcription: live captions and meeting notes

Real-time differs from batch in one way: instead of waiting for the recording to finish, you stream audio chunks while the user is still speaking and receive partial transcripts back. The latency target is under one second from speaking a word to seeing it on-screen.

OpenAI Whisper is batch-only via its public API — you can't stream into it. For real-time, use a service designed for it: Azure Speech, Deepgram, AssemblyAI, or self-host Whisper with a streaming wrapper. Trying to chunk audio into 5-second pieces and call batch Whisper repeatedly works but produces poor word boundaries.

The Blazor pattern: client captures audio with MediaRecorder in 250 ms chunks, sends each chunk over SignalR, server forwards to the streaming STT service, the service emits partial transcripts back, Blazor renders them in a reactive transcript area.

// Server-side hub

public class TranscriptionHub : Hub

{

public async IAsyncEnumerable<TranscriptUpdate> StreamAudio(

IAsyncEnumerable<byte[]> audioChunks,

[EnumeratorCancellation] CancellationToken ct)

{

await using var connection = await _streamingStt.ConnectAsync(language: "en-US", ct);

// Forward client audio to STT service in parallel with receiving transcripts

var sendTask = Task.Run(async () =>

{

await foreach (var chunk in audioChunks.WithCancellation(ct))

await connection.SendAudioAsync(chunk, ct);

await connection.SendEndOfStreamAsync(ct);

}, ct);

await foreach (var update in connection.ReceiveTranscriptsAsync(ct))

yield return new TranscriptUpdate(update.Text, update.IsFinal, update.SpeakerId);

await sendTask;

}

}

On the Blazor client, JS interop wraps MediaRecorder and pushes chunks into the SignalR stream. The component renders interim (low-confidence, may change) and final transcript segments differently:

@page "/transcribe"

<button @onclick="StartAsync">Start Recording</button>

<button @onclick="StopAsync">Stop</button>

<div class="transcript">

@foreach (var seg in _final)

{

<p>@seg.Text</p>

}

@if (!string.IsNullOrEmpty(_interim))

{

<p class="text-muted-foreground italic">@_interim</p>

}

</div>

@code {

List<TranscriptSegment> _final = new();

string _interim = "";

HubConnection? _hub;

async Task StartAsync()

{

_hub = new HubConnectionBuilder()

.WithUrl(Navigation.ToAbsoluteUri("/transcribe-hub"))

.Build();

await _hub.StartAsync();

// JS interop: start MediaRecorder, push chunks into a channel

await JS.InvokeVoidAsync("startMicCapture", DotNetObjectReference.Create(this));

}

[JSInvokable]

public Task OnAudioChunk(byte[] chunk) =>

_hub!.SendAsync("AudioChunk", chunk);

// Receive transcripts from the hub stream

// ...

}

  • Voice forms: dictation with structure

Inspections, service tickets, medical visit notes, vehicle reports — all dominated by typing on small screens in awkward environments. Voice forms let the user dictate naturally, then an LLM extracts the structured fields. Productivity gains are dramatic for field-worker workflows.

The flow combines two existing patterns: transcribe the audio, then extract structured fields via JSON-schema output (see our document intelligence guide).

public record VehicleInspection(

string LicensePlate,

int Mileage,

bool TiresOk,

bool BrakesOk,

bool LightsOk,

string OverallCondition, // "good" | "fair" | "needs_service"

List<string> Concerns);

public async Task<VehicleInspection> ParseInspectionAsync(string transcript)

{

var response = await _chatClient.GetResponseAsync<VehicleInspection>(

$@"Extract vehicle inspection details from the dictated notes below.

Use null for fields not mentioned.

NOTES:

{transcript}");

return response.Result;

}

// Inspector dictates: "License plate Echo-Bravo-Charlie 1234. Mileage 47,820.

// Tires are good but front brakes need attention soon. All lights working.

// Overall fair condition."

//

// Output:

// {

// "LicensePlate": "EBC-1234",

// "Mileage": 47820,

// "TiresOk": true,

// "BrakesOk": false,

// "LightsOk": true,

// "OverallCondition": "fair",

// "Concerns": ["front brakes need attention"]

// }

The Blazor UI shows the form with extracted values pre-filled. The inspector reviews, corrects any errors, submits. A 3-minute typed report becomes a 30-second voice note + 5-second review.

  • Voice assistants: the conversational loop

A voice assistant joins three pieces: STT (hear the user), LLM (decide what to say), TTS (speak the response). Each step can stream — and for the experience to feel natural, they must. The user shouldn't wait 5 seconds for the assistant to think + render + synthesize the whole response.

public async IAsyncEnumerable<byte[]> ConverseAsync(

Stream userAudio,

[EnumeratorCancellation] CancellationToken ct)

{

// 1. Transcribe user audio (final result; usually fast enough)

var transcript = await _transcriber.TranscribeAsync(userAudio, "audio/wav", ct: ct);

// 2. Stream LLM tokens, accumulating into sentences

var messages = new List<ChatMessage> { new(ChatRole.User, transcript.Text) };

var sentenceBuffer = new StringBuilder();

await foreach (var update in _chatClient.GetStreamingResponseAsync(messages, cancellationToken: ct))

{

sentenceBuffer.Append(update.Text);

// Flush to TTS at sentence boundaries for low-latency speech start

var text = sentenceBuffer.ToString();

var lastBoundary = LastSentenceBoundary(text);

if (lastBoundary > 0)

{

var sentence = text[..lastBoundary];

sentenceBuffer.Remove(0, lastBoundary);

await foreach (var audioChunk in _tts.SynthesizeStreamingAsync(sentence, ct))

yield return audioChunk;

}

}

// Flush remaining

if (sentenceBuffer.Length > 0)

await foreach (var audioChunk in _tts.SynthesizeStreamingAsync(sentenceBuffer.ToString(), ct))

yield return audioChunk;

}

This pattern produces audible response within ~600 ms of the user stopping speaking: STT (~200 ms for short utterances) → first LLM tokens (~300 ms) → first TTS audio (~100 ms). Each stage overlaps with the next.

  • Call-center analytics: transcript + summarize + flag

The pattern for support-call analytics:

Transcribe the call with speaker diarization (who said what).

Summarize with an LLM call — what was the issue, what was the resolution, was the customer satisfied.

Flag with another LLM call — escalation needed, compliance violation, glowing testimonial worth keeping.

Index the transcript for semantic search (see our semantic search guide).

Display in a Blazor admin dashboard with filters, search, and one-click playback at the relevant moment.

public record CallAnalysis(

string Summary,

string Resolution,

string Sentiment, // "positive" | "neutral" | "negative"

bool EscalationNeeded,

bool ComplianceIssue,

List<string> Topics,

List<ActionItem> FollowUps);

public async Task<CallAnalysis> AnalyzeCallAsync(string transcript)

{

return (await _chatClient.GetResponseAsync<CallAnalysis>(

$@"Analyze this customer support call transcript. Be objective.

{transcript}")).Result;

}

Add a Blazor dashboard with filters by sentiment, escalation flag, agent, date range. Click a row to play back the call at the relevant timestamp (Whisper segment data has start times for each phrase).

The TTS direction

Modern neural TTS sounds genuinely human — natural prosody, emotion, multiple speaker styles. Azure Speech, ElevenLabs, and Cartesia all produce audio your users will mistake for a recorded person. Use SSML to control pacing, emphasis, and pronunciations.

// Azure Speech example — neural voice with SSML

public async IAsyncEnumerable<byte[]> SynthesizeStreamingAsync(

string text,

[EnumeratorCancellation] CancellationToken ct)

{

var ssml = $@"<speak version='1.0' xml:lang='en-US'>

<voice name='en-US-AvaNeural'>

<prosody rate='0.95'>{SecurityElement.Escape(text)}</prosody>

</voice>

</speak>";

using var synth = new SpeechSynthesizer(_config, AudioConfig.FromStreamOutput(_audioStream));

using var result = await synth.SpeakSsmlAsync(ssml).WithCancellation(ct);

// Stream PCM frames out as they arrive

using var stream = AudioDataStream.FromResult(result);

var buffer = new byte[4096];

uint read;

while ((read = stream.ReadData(buffer)) > 0 && !ct.IsCancellationRequested)

yield return buffer[..(int)read];

}

Production patterns

Microphone permission UX

Browsers prompt the user once for mic access. Handle the rejection gracefully — your Blazor component should show a fallback ("we couldn't access your microphone; you can paste audio instead") rather than appearing broken. The permissions.query API tells you whether you have access before trying.

Audio format conversion

Browsers record in WebM/Opus by default; Whisper prefers WAV at 16 kHz mono. Convert via NAudio on the server or via WebCodecs on the client. The client option is faster (no server-side ffmpeg dependency) but adds JS complexity.

Cost

Transcription via OpenAI Whisper API runs ~$0.006/minute. Real-time STT services (Deepgram, Azure) are similar. Neural TTS lands at ~$15/million characters. For a 1-hour recording with 10-minute summary spoken back, you're at well under $1 per session — meaningful at high volume but trivial for occasional use.

Privacy and recording consent

Recording conversations triggers legal requirements that vary by jurisdiction. For one-on-one product features (voice forms, voice assistants), in-app consent is usually sufficient. For call recording (support calls, sales calls), you need explicit notification at call start in most regions. Build the consent UX before the AI plumbing.

Hosting recommendations

ASP.NET Business — $17.49/mo

Real-time transcription with multiple concurrent users, voice assistants, call analytics dashboards. 2 GB RAM for headroom.

View Business plan →

ASP.NET Professional — $27.49/mo

High-throughput call center platforms, multi-tenant voice products, agency deployments. 4 GB per pool, highest priority scheduling.

View Professional plan →

FAQs

Can I self-host Whisper on Windows?

Yes via Whisper.net (ONNX bindings to whisper.cpp). For the smallest models (tiny, base) CPU inference is feasible. For larger models (small, medium, large) you'll want GPU. CPU-only self-hosted Whisper on shared hosting isn't a production fit; offload to a GPU server or use hosted API.

How accurate is real-time transcription?

For clean audio in supported languages, 90-95% word-level accuracy. Background noise, heavy accents, and technical jargon degrade this. Display the live transcript as text the user can edit — most products treat transcripts as starting points, not gospel.

What about WebRTC for two-way calls?

WebRTC handles peer-to-peer audio with sub-100ms latency, ideal for voice assistants and live conversations. Pion (Go) is the most-cited reference; the .NET option is SIPSorcery. Add a TURN server (coturn) for users behind restrictive NATs.

Can I clone someone's voice for TTS?

Voice cloning exists (ElevenLabs, others) but consent is paramount. For your own brand voice (the company "spokesperson"), record 30 seconds with explicit IP ownership and use cloning legitimately. Do not clone customer voices without unambiguous consent.

How do I handle non-English audio?

Whisper handles 99+ languages with one model. Tell it the language (faster) or let it auto-detect (slightly slower). For TTS, pick voices in the target language explicitly — Azure has 400+ neural voices across languages.

What's the latency budget for a natural-feeling voice assistant?

Target under 800 ms from user-stop-speaking to first TTS audio playing. Anything beyond ~1.5 s feels broken. Stream every stage (STT → LLM → TTS) and start TTS at first sentence boundary, not at full LLM completion.

Ship it

Voice is one of the highest-impact AI features you can add — accessibility, hands-free workflows, meeting productivity, call-center analytics. The .NET 10 stack ships everything: Whisper-grade STT via API or self-hosted, neural TTS with streaming output, real-time audio over SignalR, and Blazor components for the recording UX.

Adaptive Web Hosting's ASP.NET hosting plans run all of this on real Windows + IIS with dedicated app pools that handle persistent SignalR connections cleanly, SQL Server 2022 included for storing transcripts and recordings metadata, and free SSL on every plan.

Back to Blog