Backchanneling in Voice AI: Why the Small Sounds Matter

Backchanneling in Voice AI: Why the Small Sounds Matter

April 6, 202610 min read1 views
Table of Contents

The first time I heard a good AI phone agent on a call, the thing that surprised me was not the voice quality or the response latency. It was a small “mhm” that landed at exactly the right moment, three seconds into a long sentence I was speaking. It was such a small sound, barely a syllable, and it was the thing that made me stop thinking of the agent as a piece of software.

That sound is called a backchannel, and it turns out to be one of the most important things a voice AI system can do. It is also one of the easiest things to get wrong. This article is about what backchanneling is, where it comes from, why it matters so much for AI phone agents, and how to build it into a real system without making it weird.


What backchanneling actually is

Linguists borrowed the term “back channel” from telecommunications. In a two-way transmission, the back channel is the return path — the acknowledgments that flow from the receiver to the sender to confirm that the primary channel is working. The linguist Victor Yngve applied this to human conversation in a 1970 paper called On Getting a Word in Edgewise. His observation was that a listener in a conversation is not silent. They are continuously sending short signals back to the speaker — vocalizations, head nods, facial expressions — and those signals shape how the speaker continues.

The English backchannel inventory is small and consistent. Mhm, uh-huh, yeah, right, okay, I see, sure, got it. In typical production they last under 500 milliseconds, they land during brief pauses in the speaker's output, and they do not attempt to take the turn. They are not answers. They are not follow-up questions. They are signals that mean “I am listening, please continue.”

Non-verbal backchannels exist too — short breath sounds, small laughs, hums — but on a phone call the verbal ones do most of the work. Over a phone line the listener cannot see the speaker's face, so the verbal signals have to carry the entire weight of “I am still here.”

Why silence feels wrong on a phone call

Try this experiment. Call a friend and tell them a three-sentence story about something that happened to you today. Pay attention to what they do while you are telling it. Unless they are actively distracted, they will emit a sound every few seconds. Oh yeah. Mhm. Ha. Right. These sounds are not random — they land at pause points between clauses, and if you were to remove them, the call would immediately feel strange.

Now imagine the same call with ten seconds of complete silence between your sentences. You would start to wonder whether the line was still open. You would probably say “hello?” or “are you there?” or repeat yourself. The silence would break the conversation in a specific way: not by preventing you from speaking, but by destroying your confidence that anyone is listening.

This is exactly the failure mode of first-generation AI phone agents. The caller speaks. The system transcribes. The transcription goes to an LLM. The LLM generates a response. The response goes to text-to-speech. The audio comes back. The total time budget is often over two seconds. During those two seconds, the caller hears nothing. The AI has not failed — it is computing a perfectly good response — but the caller experiences the silence as absence, and by the time the AI finally speaks, the caller is already wondering whether to hang up.

The three places backchanneling matters in an AI pipeline

In a production voice AI system, there are three distinct moments where backchanneling has to happen correctly.

During the caller's turn. When the caller is speaking a long utterance — explaining a problem, telling a story, describing a situation — the agent should emit small acknowledgments at natural pause points. This is the thing most first-generation systems completely fail to do, because their pipelines are turn-based: the agent waits for the caller to finish, then processes, then responds. A good agent instead streams its listening behavior in real time.

While the agent is thinking. When the caller finishes and the LLM is generating a response, there is a gap — often 300 to 800 milliseconds — while the computation happens. A short filler (“okay,” “let me check,” “right”) bridges the silence so the caller knows the agent heard them and is responding. This is not stalling — it is the same thing a human does when asked a question they need a moment to think about.

From the caller, toward the agent. Less obvious but equally important. When the agent is delivering a multi-sentence response and the caller emits their own backchannel (“yeah, yeah, go on”), the agent should recognize it as a continuer signal and keep going, rather than interpreting the sound as a new turn that needs a response. Systems that stop mid-sentence to ask “I'm sorry, did you say something?” every time the caller hums get irritating fast.

Two architectures, two completely different experiences

There are two ways to build an AI phone agent today, and they produce very different backchanneling behavior.

The first is the classic three-stage pipeline: speech-to-text, then an LLM, then text-to-speech. Each stage is a separate API, and the pipeline waits for the caller to finish speaking before the first stage can even start. Backchanneling in this architecture has to be simulated — usually by detecting brief pauses in the audio stream and playing a pre-recorded “mhm” sound, or by streaming a filler phrase while the LLM runs. It works, sort of, but it never sounds quite right. The timing is off. The sounds do not match the caller's cadence. The fillers come out a beat late or a beat early.

The second is a native audio model — OpenAI's GPT Realtime, Google's Gemini Live, and a handful of similar models. These models predict audio tokens directly, trained on natural conversation data. Backchanneling comes out for free, because the model has learned when to produce “mhm” the same way it has learned when to produce any other word. Developers moving from a webhook pipeline to a streaming architecture almost universally report the same thing: the agent suddenly “feels human.” The backchanneling is most of what they are noticing.

This is the biggest single reason streaming-mode speech-to-speech models have taken over voice AI deployments in the past year. Not the raw latency, not the accuracy, not the voice quality — the small acknowledgments that keep the caller from thinking they are talking to a machine.

How to simulate it in a webhook pipeline

If you are stuck with a three-stage pipeline — usually because you need the flexibility of choosing your own LLM or integrating with a specific STT provider — you can still fake decent backchanneling. The techniques are well-known but require careful tuning.

Voice activity detection with filler audio. Use a VAD library (Silero, WebRTC VAD, or the VAD built into most real-time STT providers) to identify when the caller is speaking versus pausing. When a pause longer than about 500 ms is detected mid-utterance, play a short pre-recorded backchannel sound. Keep the library of sounds small (five to ten variants so it does not sound robotic) and match them to the agent's voice personality. Do not play them during silences longer than two seconds — that signals the caller has stopped speaking, not paused mid-thought.

Filler phrases during LLM generation. When the LLM starts generating a response, have the audio pipeline immediately emit a short filler phrase (“okay, let me see” or “one moment”) before the full response is ready. This buys 1–2 seconds of bridge time. The risk is that the filler becomes repetitive, so vary the phrases and keep them short.

Caller-side backchannel suppression. Instruct the LLM to ignore short continuer utterances in its transcript stream. Include a rule in the system prompt like: “If the caller says only ‘mhm,' ‘yeah,' or ‘right' in the middle of your response, continue your current thought. Do not treat these as new input requiring a reply.” Some pipelines handle this at the transcription layer by filtering short utterances before they reach the model.

None of these approaches approximate a native audio model. They are best-effort workarounds. But together they can produce an agent that sounds substantially more human than a naive turn-based system.

Where backchanneling goes wrong

Too little backchanneling is the obvious failure mode, but too much is worse. Over-backchanneling feels performative or mocking. An agent that says “mhm” after every single sentence starts to sound passive-aggressive, the way a spouse who is barely listening says “uh-huh” through a monologue. A good rate is one backchannel every 2–4 pause points during a long caller utterance, not every pause.

Mistimed backchannels are nearly as bad. A “yeah” that lands in the middle of a word rather than at a clause boundary sounds like the agent is interrupting. A “right” that comes 600 ms after the caller finishes sounds delayed and awkward, as if the agent was not actually listening and is remembering to respond. Native audio models get the timing right because they were trained on timing; rule-based systems get the timing right only if you tune them obsessively.

Cultural mismatch is the subtler problem. English backchannels (mhm, yeah, okay) sound strange in languages that use different cues. Japanese conversation relies heavily on hai and sou desu ne, and a Japanese-speaking caller who gets “mhm” from an agent perceives it as either foreign or dismissive. For multilingual AI phone agents, the backchannel inventory has to match the language the caller is speaking — not the language the agent is configured in. Modern speech-to-speech models handle this reasonably well when they detect the language shift, but only when the training data covers the target language.

Finally: turn-stealing. A filler phrase that is too long stops being a backchannel and becomes a new turn. “Let me see what I can find for you today” is a filler phrase. “Mhm” is a backchannel. The difference matters because the first one interrupts the caller, and the second one supports them.

A practical rule: what to measure

If you are building or tuning a voice AI system and want to improve the feel, measure these three things:

Average agent silence during a long caller utterance. Record a set of test calls where the caller speaks for 10+ seconds continuously and measure how long the agent is silent. A good number is under 3 seconds without a vocalization. A bad number is 10+ seconds — which means your agent is behaving like a turn-based system.

LLM response latency gap. Measure the time between the caller finishing speaking and the agent beginning its response. Anything over 800 ms without a bridging sound feels wrong. Anything over 1.5 seconds feels broken. If your LLM is genuinely slower than this, the fix is not faster inference — it is a better filler phrase strategy.

Caller interruption rate during agent speech. Measure how often the caller says something short (“yeah,” “mhm,” “okay”) while the agent is delivering a response, and how the agent handles it. If the agent stops mid-sentence more than 10% of the time in response to these short continuers, you have a turn-detection problem. The caller is supporting you, not interrupting you.

These are simple metrics and they correlate directly with subjective quality scores. If you improve them, the agent will feel better.

Why this matters for the business

Backchanneling is the kind of detail that sounds trivial and turns out to be load-bearing. The agents that get it right feel human to callers. The agents that do not feel like phone trees. That perceptual difference compounds into every metric the business cares about: hangup rates, completion rates, conversion rates, repeat-call rates, CSAT. The callers who feel they are being listened to stay on the line; the callers who feel they are not, hang up.

For most AI phone agent deployments, the most impactful change a team can make in the first month of production is tuning the backchanneling behavior. It will feel small. It is not.


Further reading

Ready to build AI phone agents that sound human? Sign up for BubblyPhone Agents and start with streaming mode — backchanneling is included by default when you use a native audio model.

Ready to build your AI phone agent?

Connect your own AI to real phone calls. Get started in minutes.