Conversation Intelligence

Backchanneling

By Vadim Kouznetsov, Founder of BubblyPhone · Last updated April 5, 2026

Backchanneling is the short verbal signals a listener produces during a conversation — sounds like “mhm”, “uh-huh”, “right”, or “okay” — that tell the speaker they are being heard and should continue. Linguists borrowed the term from telecommunications, where the “back channel” is the reverse path in a two-way transmission. In a voice conversation, backchannels are sent on the listener’s turn without taking over the floor.

Where the term comes from

Linguist Victor Yngve introduced “back channel” in his 1970 paper On Getting a Word in Edgewise, where he argued that a listener in a conversation is not silent. They are continuously transmitting signals back to the speaker — short vocalizations, head nods, and facial expressions — and those signals shape how the speaker continues. The field of conversation analysis has studied these cues for decades, and researchers today still rely on Yngve’s framing when describing turn-taking behavior.

In English, the most common verbal backchannels are mhm, uh-huh, yeah, right, okay, and I see. They typically last under 500 milliseconds, occur during brief pauses in the speaker’s output, and do not attempt to take the turn. Non-verbal backchannels include short breath sounds and laughter. Every language and culture has its own inventory — Japanese relies heavily on hai and sou desu ne, for example — which is one of the reasons multilingual voice agents are genuinely hard to build well.

Why backchanneling matters for AI phone agents

On a phone call, a listener who produces no sound for ten seconds feels either absent or dismissive. This is the core problem with first-generation AI voice agents: the caller speaks, the system goes silent while it transcribes, routes through an LLM, and synthesizes a reply, and then the AI speaks. The silence in the middle is unnatural. Human listeners fill that gap with backchannels. A good AI agent has to do the same — or it sounds broken, even when the response content is perfect.

There are three places where backchanneling shows up in an AI phone agent pipeline:

  1. During the caller’s turn. When the caller is explaining something long, the agent produces short acknowledgments at natural pause points so the caller knows they are being heard.
  2. While the agent is thinking.When the caller finishes and the LLM is still processing, a brief “okay” or “let me check” bridges the silence until the real response is ready.
  3. In the caller’s direction, from the agent. When the agent is delivering a long response and the caller emits a backchannel (“yeah, yeah”), the agent should recognize it as a continue signal and not stop mid-sentence to ask a follow-up question.

How it is implemented in practice

Backchanneling in AI voice agents is implemented in two fundamentally different ways depending on the architecture.

Real-time speech-to-speech modelslike OpenAI’s GPT Realtime and Google’s Gemini Live handle backchanneling natively. Because these models predict audio tokens directly and are trained on natural conversation data, they produce backchannels as part of their output without any special handling. In practice, this is one of the most important reasons a streaming architecture sounds more natural than a webhook-based one — the model has learned whento say “mhm” from its training data, not from a rule someone wrote.

Webhook-based pipelines, where speech-to-text, LLM, and text-to-speech are separate services, have to simulate backchanneling with explicit engineering. Typical approaches include:

  • Using voice activity detection (VAD) to identify when the caller pauses mid-thought (silences shorter than 800 ms) and playing a pre-recorded backchannel sound.
  • Streaming a filler phrase (“okay, let me look that up”) while the LLM is still generating the real response.
  • Handling caller-side backchannels by instructing the LLM to ignore short continuer utterances (“mhm”, “right”) rather than treating them as new input that needs a reply.

None of these approximate the real thing as well as a model that produces backchannels natively. This is why many AI phone agent developers moving from a webhook pipeline to a streaming pipeline report that their agents suddenly “feel human” — the backchanneling arrives for free.

Where backchanneling goes wrong

Too much backchanneling is as bad as too little. Common failure modes in production AI voice agents include:

  • Over-backchanneling.The agent says “mhm” after every sentence, which starts to feel like mockery or disengagement.
  • Mistimed backchannels.A “yeah” that lands in the middle of a word instead of at a pause sounds aggressive.
  • Turn-stealing.A filler phrase that is too long (“let me see what I can find for you today”) stops being a backchannel and becomes a new turn, interrupting the caller.
  • Cultural mismatch. English backchannels sound rude in languages that expect different cues. An agent serving Japanese callers needs hai, not okay.

How BubblyPhone Agents handles backchanneling

In streaming mode on BubblyPhone Agents, audio flows directly between the phone network and a speech-to-speech model over WebSocket. Backchanneling is produced by the model itself, which means developers do not have to engineer it. Choose a model (GPT Realtime or Gemini Live) and the backchanneling behavior comes with it.

In webhook mode, you are responsible for your own conversation pipeline, including any simulated backchanneling. See the guide on VoIP AI architecture for how the two modes differ, or the API documentation for mode selection.

Further reading