Conversation Intelligence

Sentiment Analysis

By Vadim Kouznetsov, Founder of BubblyPhone · Last updated April 5, 2026

Sentiment analysis is the classification of text or speech along an emotional axis — typically positive, neutral, or negative — to measure how a caller feels about the conversation, a topic, or a brand, at the sentence level, the turn level, or the whole-call level. It is the metric that every AI phone agent dashboard displays first, and the metric that most frequently means the wrong thing.

Three levels of sentiment

The same audio can produce three different sentiment scores depending on what you measure.

Sentence-level sentimentscores each individual utterance. It is the easiest to compute and the most misleading. A caller who says “I am really frustrated” early in a call and “thank you so much” at the end has produced one negative sentence and one positive sentence, but the overall experience is positive — the AI resolved their frustration. Averaging the sentences gives you neutral, which is wrong.

Turn-level sentimentscores each back-and-forth exchange. This captures more context but still misses trajectory. Two calls can have the same distribution of turn sentiments in very different orders — one starts negative and ends positive (a save), the other starts positive and ends negative (a churn risk). Without looking at the sequence, both look the same.

Whole-call sentiment with trajectory is what actually works. The LLM reads the entire transcript, outputs an overall score, and separately notes the starting and ending state. This is enough to distinguish a saved call from a soured one, and it is what a human QA reviewer does naturally when they grade a conversation.

Why sentiment scores are often wrong in ways that matter

Sentiment models are trained on text. Phone calls are speech. The conversion from speech to text loses information that matters for sentiment.

Tone is gone.“That’s great” sincere and “that’s great” sarcastic are the same three words in a transcript. A human listening to the audio knows the difference. A text-based sentiment model cannot.
Silence is information. A caller who goes quiet for eight seconds has told you something about how they feel. In a transcript, that silence is just a gap. Speech-to-speech models preserve this signal; text-based analysis loses it.
Politeness masks frustration. Callers in many cultures stay polite even when they are angry. A sentiment model trained on web text will score a polite complaint as neutral or positive, missing the actual dissatisfaction entirely.
Resolution trumps tone.A call that stays negative throughout but ends with the caller’s problem solved is a successful call. Sentiment alone cannot tell you that.

The metric you probably actually want

For most AI phone agent teams, “sentiment” is a proxy for a more specific question: did the caller get what they wanted?. That question has a cleaner answer than sentiment does. It is a yes or no. You can check it against the transcript directly, without needing to guess at emotional state. And it maps to business metrics (conversion, resolution rate, churn) that sentiment does not.

The practical recommendation: track outcome as your primary metric (did the goal get met?), sentiment as a secondary signal (how did the caller feel about how it was met?), and trajectory as a diagnostic (where in the call did things turn around or go wrong?). Alone, sentiment is a vanity metric. In combination with outcome, it becomes diagnostic information about what to improve.

When speech-native sentiment wins

One case where sentiment analysis actually becomes reliable is when it is performed on audio directly rather than on the transcript. Modern speech-to-speech models and audio-native classifiers can detect prosodic features — pitch, intensity, speaking rate, pauses — that text loses. They pick up sarcasm and frustrated politeness that text models cannot.

This is not free. Audio-native sentiment analysis is more expensive per call than text-based, and the tools are less mature. But for high-stakes applications — a suicide-prevention hotline, a churn-risk flagging system, a fraud-detection pipeline — audio-based sentiment is where the accuracy lives.