Why Clustering Call Transcripts into Topics Fails
NLP

Customer-facing teams benefit tremendously from clustering call transcripts to identify valuable themes, but this finicky approach often leads to extremely misleading insights; here's why.
Documents are tricky
To understand why topic modeling fails, we need to dive deep on the very first step of the process: Clustering analyzes a corpus of documents - what’s a document?
Is it the entire transcript?
If we want customer-specific insights, should we split the authors before clustering or filter after? What if there are multiple stakeholders on the call?
If we’re looking to pinpoint specific themes, should a document be a sentence, a phrase, or everything said on the call?
Unfortunately, there is no “correct” approach, and each approach will produce entirely different results. So how do customer-facing teams know they’re looking at the most “valuable” or “best” approach to defining a document?
They usually don’t.
There’s either a team of internal data scientists / analysts that created a bespoke NLP pipeline (hopefully based on some back-and-forth with stakeholders), or the team is referencing a vendor’s Sales, VoC, or Product Management dashboard with a “Topic Analytics” feature that required the vendor’s engineering team to create a “one-size fits all” approach, with zero ability to adjust the document granularity.
That’s problem number 1. Problem number 2 is driven by how clustering actually works.
Math vs reality
The goal of clustering is to group semantically similar documents together. No need to go into the math, but documents are transformed into a set of numbers (vectors) and the model looks for documents whose vectors are frequently close in distance. Sometimes, this ISN’T what we want, even though we think it is.
How so? Here’s the anatomy of a typical call:
Intros, small talk, catch-up
Bridge, agenda
Content, subject of the call
A mix of back-and-forth, questions, arguments, etc; very unpredictable
Closing statements
1,2 and 5 are extremely common, almost every call has them. So what we find is that our clusters are dominated by semantically similar phrases found in these steps, such as, “Good morning”, “What’s up”, “Hello”, “Follow up”, “Goodbye”, “Weather”, “Weekend”; conversation filler, junk topics. What about 3 and 4? This is where our topic quality can suffer from wishful thinking - here's an example:
AE: "Whether it’s related to our products or not, what’s the biggest bottleneck in your current workflow?"
Customer: "Manual data entry. My team spends ten hours a week just moving numbers between Hubspot and analytics tools. Have you used Zapier? We actually just did a demo with the Zapier team to see if we can connect the two systems, but we’re having trouble getting our team up to speed on creating those workflows - it’s not as easy as you’d think"
AE: "I’ve heard that a lot, I actually use Zapier to send a Slack message Monday morning that summarizes all my emails from the previous week and gives me action items and I had to watch a YouTube tutorial on it. It’s definitely not as easy as it seems. Which LLM did you use?”
Customer: “Yeah I did the same thing! Are you on the paid or free plan for that? And I used OpenAI I think, I don’t remember.”
AE: “I think paid? I want to say our entire team gets Zapier Pro or premium or whatever it's called, nothing enterprise though.”
Customer: “Do you know how much the Pro version costs? And I’m assuming it’s per seat?”
From a mathematical standpoint, this conversation would most likely converge around Zapier, no matter how we set up the documents. But from a topic perspective, what the AE actually needs to know is “manual data entry is a problem”, which was only mentioned once! People don’t typically speak in the way we want them to, and the insights are buried in a few words or a few sentences that, from a term frequency or clustering model, aren’t necessarily significant, but to a business or team 100% are. To add complexity, if paired with a sentiment model, it’s very possible for the topic to appear as negative - as this measures tone not intent - which at a glance, could indicate that “Zapier” is a problem/they don’t like Zapier, when in fact Zapier is actually an attempted SOLUTION to the problem of manual data entry!
So what ends up happening is that our topic clusters are filled with junk/filler conversations in the beginning/end, and non-actionable or non-ideal (from a business reporting perspective) centroids the topic converges on, simply due to term frequency. Clustering is great at objectively explaining what happens across conversations from a pure semantic perspective, but very bad at grouping business-relevant statements or actionable-terms that are useful to driving real value, resulting in misleading or non-valuable insights. This is where classification of such actionable terms and THEN clustering or even topic seeding can help.
New data
The third problem is a combination of steps 1 and 2; new data. Clustering is great when it's a one-time operation that’s used to explain what’s inside data; it condenses a large corpus of data into a human-readable format. But what happens if we introduce 1 new document? Each document contributes to the “semantic space” of the model, affecting how documents are grouped - so should we
re-cluster the entire corpus? We’d lose our original groups.
First see if the document fits into an existing cluster? What happens if there was a group of documents that were weakly fitting into clusters (because they were the best option), that had this new doc been included in the first model, would have created an entire cluster of tightly fitting documents?
Should we remove those documents from their original clusters and make this new and better one?
Again, unfortunately, there is no “correct” approach to streaming new documents, and customer-facing teams are at the mercy of whatever approach the data science or engineering team makes.
What do we do?
These 3 challenges: document defining, business-oriented clustering, and streaming new data are difficult challenges to overcome, and there is no single best way to manage them. For customer-facing teams, it’s important to know that the 10-20 topics they see in a dashboard are downstream of a series of complex and lossy decisions, each involving various trade-offs that dictate the insights they see. Where KPIs like “Revenue” can vary slightly based on a team’s custom way of defining/filtering/slicing the data in a way that’s documentable and agreed upon, clustering can provide completely different results with no clear way of defining what’s best; it’s very subjective and requires a hands-on, feedback-driven approach that most vertical software products don’t offer.
So what can we do about this? Learn from those in the structured data world.
The modern data stack is an extremely complex system of moving and transforming data at scale, and because unstructured analytics is fairly new (horizontally across the org that is, we've been doing NLP for a long time), the 30+ years of warehousing, governance, and modeling standards haven't found their way to the unstructured world (yet). The solution is being more intentional with our analyses, assisting engineering orgs by decreasing deployment-feedback time, and providing more tooling for stakeholders to guide the clustering process.
That's a boring answer, but it's true.
The analytics community and business community at-large is treating every problem as an intelligence problem; just more tokens and more reasoning and all our dreams will come true. But clustering for business-reporting is not solely an intelligence problem, and treating it as one is a dead-end. This is a problem of process, orchestration, and user-experience.
Teams must ingest/stream in transcripts at scale, structure documents, cluster the appropriate fields, assess, iterate, and maintain an evolving taxonomy. Managing this end-to-end flow may appear extremely cumbersome, which is why we're building a managed solution at Siftree.



