Observability & Analytics

Call Transcription

By Vadim Kouznetsov, Founder of BubblyPhone · Last updated April 5, 2026

Call transcription is the conversion of a phone conversation into readable text — either in real time as the conversation happens or as a post-processing step after the call ends — using automatic speech recognition (ASR) to turn audio into words. A good transcript is the foundation for everything from call analytics to compliance audits to LLM-driven call handling.

Real-time vs post-call transcription

These look similar from the outside but they solve different problems and have different engineering constraints.

Real-time transcription runs during the call. Audio streams into an ASR engine that emits partial transcripts as words are spoken, usually within 200 to 400 milliseconds of the speaker finishing a phrase. This is what an LLM-driven AI phone agent needs: the model has to see what the caller said before it can respond, so the STT latency is inside the critical path of every turn. Real-time ASR engines are optimised aggressively for speed at the cost of some accuracy.

Post-call transcription runs after the conversation ends, usually on the recorded audio file. It can take seconds or minutes per call without anyone noticing, because nothing is waiting on it. The relaxed latency budget means post-call engines can use larger models, more expensive acoustic processing, and multi-pass techniques that would be impossible in real time. The result is higher accuracy on the same audio.

Most production AI phone agent systems run both. Real-time transcription drives the live conversation; a separate higher-quality post-call transcription generates the archive used for analytics and compliance.

The accuracy realities nobody mentions in marketing

ASR accuracy is measured in Word Error Rate (WER) — the percentage of words that are substituted, deleted, or inserted compared to a ground-truth reference. Vendor marketing quotes WER numbers like “under 5%”. Production reality is different.

  • Telephone audio is worse than studio audio. Phone calls are bandlimited to 300–3400 Hz on a traditional line and use lossy codecs (G.711, Opus). An engine that hits 3% WER on clean broadcast news will hit 8 to 12% on telephone audio.
  • Accents matter more than vendors admit. English ASR engines trained primarily on US speech perform noticeably worse on Indian English, African English, and even some regional British dialects. The gap can be 2x.
  • Background noise compounds.Office noise, car noise, a TV in the background — each one can double the error rate.
  • Domain-specific words are hard. Product names, drug names, technical jargon, company names, and personal names are the words ASR engines get wrong most often, and they are also the words you most care about.

The practical implication is that you cannot assume the transcript is the same as what was actually said. If a downstream process depends on a specific term being in the transcript, that process needs a fallback for when the term is misrecognised.

Diarization: who said what

A transcript without speaker labels is a wall of text. Diarization is the process of segmenting audio by speaker so each line in the transcript is attributed to either the caller, the agent, or whoever else is on the call. In a two-party phone call, diarization is relatively easy because the caller and agent audio arrive on separate channels (or can be separated by simple voice activity detection). In a conference call with three or more participants it becomes substantially harder.

For AI phone agents, diarization is almost always trivial because the agent’s own text is known (the model generated it), so the only thing to attribute is whatever the caller said. This is why call transcripts from AI agent systems look cleaner than transcripts from human contact centers: there is no speaker labeling uncertainty.

Compliance considerations

Transcription is not just a technical choice, it is a legal one. A transcript is a record of what was said, and the same wiretap and consent laws that apply to recording audio apply to storing transcripts. If your jurisdiction requires two-party consent to record calls, the same rule covers transcripts derived from those calls. Announce the transcription at the start of the call if required by local law, and make sure your retention policy is consistent between the audio and the text.

Transcription in BubblyPhone Agents

BubblyPhone Agents produces a full transcript for every call automatically. The endpoint returns structured data with speaker labels, timestamps per turn, and the full text of everything said. You retrieve it with a single API call after the conversation ends. For an end-to-end example of using transcripts for analytics, see the guide on call analysis with AI.

Further reading