The Case for a Semantic Layer for Unstructured Data
Business data is generated internally and externally, flowing through various platforms, storage-systems, and internal tools before reaching the end data-consumer. In the early 2010s, business intelligence teams began implementing systems that are called, "Semantic Layers" - software that fits between the business intelligence applications and the data warehouse, ensuring that business metrics were computed the same way across the entire company, no matter which application or warehouse the data team decided to use.
Unstructured data (currently) doesn't have a similar intelligence layer. We're fixing that here at Siftree.
To understand how we fit into the business intelligence stack, it's important to understand how unstructured data currently flows through your business today.
Sometimes unstructured data flows directly from the source to the end consumer, sometimes it lives entirely inside the SaaS platform, and sometimes it's extracted from the SaaS product via API and lives in a data warehouse where the BI team consumes it and produces dashboards for end-users to consume insights. There are many different ways unstructured data is produced and consumed.
This is true of structured data as well, but structured data has gone through 20+ years of pipeline optimization, governance standards, data cataloging / modeling, and more, which have standardized data operations and mitigated downstream analytics risks. Are companies still behind on this? Of course, but an ideal "golden standard" does exist.
Unstructured data never received the same level of attention, as it was always extremely difficult to extract any value from it. Companies have relied on vertical SaaS to support NLP features (like topic modeling and sentiment analysis), with the insights remaining siloed in the platforms used by the organizations that purchased it. LLMs changed this dynamic in an interesting way.
Now, employees are dumping meeting notes, customer feedback, social media threads/comments, and more into ChatGPT or Claude, conducting their own analyses on top of this data, and using the LLM synthesis of the data to drive decisions. The unstructured data is sourced directly by the user (such as scraping websites with Claude Code), extracted from their choice of application (Notion, Granola, etc.), or extracted via an internal vector search database, and then stored inside of LLM notebooks and workspaces. This is exactly where 20+ years of companies focusing on being "data-driven" breaks down.
Different queries, different results, different numbers across the entire business. This is exactly why we built the semantic layer for business intelligence - to avoid these exact issues and align numbers across the entire business. With this new unlock LLMs provide, we've thrown our data-driven mindsets out of the window; each employee now has their own perspective of the data, given whatever the LLM spits out.
For an enterprise to be data-driven in this "fuzzy" embedding space LLMs operate in, we need to take them out of that space the best we can; a semantic layer makes this possible.
Here's how this would look for unstructured data:
By pre-processing and aggregating all of the unstructured data before it touches the LLM, and then connecting to each LLM service via MCP, the entire business can now be aligned on metrics that live inside this data. It's still "unlocked" by LLMs, but now they're all operating on the same, shared understanding of the data, with queries that are now 100% traceable to the source. This approach gets business back to being "data-driven" while reaping the powerful benefits of LLMs and AI agents.
What pre-processing and aggregation steps will we take to create the structured ontology for the LLMs to reason over? More on that in a different section; we call this "Inductive Intelligence".