Conversation Intelligence

Dialogue Management

By Vadim Kouznetsov, Founder of BubblyPhone · Last updated April 5, 2026

Dialogue management is the component of a conversational system that decides what to do next at each turn — what to say, what to ask, when to take an action, and when to end the conversation — based on the state of the conversation so far and the goal the system is trying to accomplish. It is the brain that sits between understanding the caller and producing a response.

The classical pipeline

In the classical architecture of a spoken dialogue system, from the 2000s through the early 2020s, dialogue management was a distinct module in a four-stage pipeline:

Speech recognition (ASR) turned audio into text.
Natural language understanding (NLU)turned text into structured intents and entities — see intent detection.
Dialogue manager took the NLU output and the current dialogue state, decided on an action, and produced a semantic response.
Natural language generation (NLG) and TTS turned the semantic response into spoken audio.

In this architecture the dialogue manager was doing real work. It had to maintain a dialogue state (what has been said, what slots have been filled, what the caller wants), select the next action from a policy, and handle things like clarification, confirmation, and error recovery. Implementations ranged from handwritten finite state machines to statistical models trained on dialogue data to reinforcement learning agents.

What LLMs did to the module

The classical pipeline still exists in places, but it no longer describes how most new voice systems are built. In an LLM-driven AI phone agent, dialogue management is not a separate module. It is a property of the language model itself.

The LLM reads the entire conversation history (or a sliding window of it), reads the system prompt that defines the agent’s goals, and produces the next response. Dialogue state is implicit in the conversation history. The policy is implicit in the prompt. There is no separate state machine to design, no separate policy to train. Whether this is an improvement or a loss of control is a real debate.

What still needs explicit management even with an LLM

Treating dialogue management as “the LLM handles it” gets you surprisingly far, but it is not the whole story. A few things still need explicit handling:

Turn-taking and interruption. When does the agent start speaking? When does it stop because the caller is speaking? This is handled by voice activity detection and conversation-level logic outside the model, not by the model itself.
Tool invocation timing. The LLM decideswhether to call a tool, but the runtime around it decides when to call it, how to handle failures, and what happens while the tool is running. Filler phrases, retry logic, and timeout handling all live outside the model.
Guardrails.Some decisions should not be left to the model. Hanging up on abusive callers, transferring to a human after a certain number of failed turns, refusing to discuss specific topics — these are usually enforced by code that wraps the LLM, not by the LLM itself.
Conversation length control. An LLM left to its own devices will happily have a 15-minute conversation. If your business logic requires calls to wrap up in 3 minutes, you need explicit dialogue management to enforce that.

Hybrid dialogue management

The most robust production systems in 2026 are hybrid: an LLM handles the flexible conversational parts, and a thin layer of explicit dialogue management enforces the things the LLM cannot be trusted with alone. The explicit layer is usually small — a few rules, a handful of tool-invocation helpers, maybe a timer — but it does the heavy lifting of turning a chatty model into a reliable agent.