The Case for a Semantic Layer for Unstructured Data

Business data is generated internally and externally, flowing through various platforms, storage-systems, and internal tools before reaching the end data-consumer. In the early 2010s, business intelligence teams began implementing systems that are called, "Semantic Layers" - software that fits between the business intelligence applications and the data warehouse, ensuring that business metrics were computed the same way across the entire company, no matter which application or warehouse the data team decided to use.


Unstructured data (currently) doesn't have a similar intelligence layer. We're fixing that here at Siftree.


To understand how we fit into the business intelligence stack, it's important to understand how unstructured data currently flows through your business today.

The Lifecycle of Enterprise Unstructured Data

How data travels from source systems to analytics and business teams across the modern enterprise stack.

Click to highlight node types or explore paths
Source
Flow
SaaS role
Storage tier
DATA SOURCESSAAS CATEGORIESSTORAGEDATA & BIBUSINESS TEAMSANALYTICALOPERATIONALCommunicationsDocumentsCollaborationCall center & voiceSurveillance & opsHR & legalCode reposSocial mediaWeb contentCustomer feedbackI+ERegulatory filingsMarket intelligenceMultimediaI+E3P data providersComms & collaborationProductivity suitesKnowledge & documentationContract & agreement mgmtCRM & revenue enablementCustomer support & experienceITSM & service mgmtContact center / CCaaSPhysical security & IoTHCM & talent mgmtLegal tech & eDiscoveryDev platforms & SCMWork & project mgmtSocial & media intelligenceMarket & financial intelDesign & creative toolsObservability & incident mgmtEmail security & archivingAI media analysisExternal data feedsCloud data warehouseLakehouseObject / blob storageEnterprise search indexLog & observability storeFile sync & shareData / analytics teamBI & dashboardingEmbedded analyticsEngineeringProductSalesMarketingCustomer successLegal & complianceHR / people opsFinanceSecurity / IT opsCorporate Strategy / BizOps
Node key
Source: Internal
Source: External
Source: Both
SaaS: Extract
SaaS: Store
SaaS: Both
Storage: Analytical
Storage: Operational
Flow key
Direct > storage > team
SaaS > storage
SaaS > team (bypass)
Terminal to team
D&A ingestion
BI delivery

Sometimes unstructured data flows directly from the source to the end consumer, sometimes it lives entirely inside the SaaS platform, and sometimes it's extracted from the SaaS product via API and lives in a data warehouse where the BI team consumes it and produces dashboards for end-users to consume insights. There are many different ways unstructured data is produced and consumed.


This is true of structured data as well, but structured data has gone through 20+ years of pipeline optimization, governance standards, data cataloging / modeling, and more, which have standardized data operations and mitigated downstream analytics risks. Are companies still behind on this? Of course, but an ideal "golden standard" does exist.


Unstructured data never received the same level of attention, as it was always extremely difficult to extract any value from it. Companies have relied on vertical SaaS to support NLP features (like topic modeling and sentiment analysis), with the insights remaining siloed in the platforms used by the organizations that purchased it. LLMs changed this dynamic in an interesting way.


Now, employees are dumping meeting notes, customer feedback, social media threads/comments, and more into ChatGPT or Claude, conducting their own analyses on top of this data, and using the LLM synthesis of the data to drive decisions. The unstructured data is sourced directly by the user (such as scraping websites with Claude Code), extracted from their choice of application (Notion, Granola, etc.), or extracted via an internal vector search database, and then stored inside of LLM notebooks and workspaces. This is exactly where 20+ years of companies focusing on being "data-driven" breaks down.

What happens when every team builds their own AI analyst

The same unstructured data. Four LLMs. Different conclusions. No governance.

UNSTRUCTURED SOURCESUNSTRUCTURED STORESLLM ACCESS POINTSCONFLICTING CONCLUSIONSCommunicationsDocumentsCustomer feedbackCall center & voiceSocial mediaMarket intelligenceObject / blob storageEnterprise search indexFile sync & shareChatGPTMicrosoft CopilotClaude APIGoogle GeminiWhat's driving customer complaints?
Pricing & billing issues
What's the sentiment on the new product?
Generally positive
What risks exist in our contracts?
IP indemnification gaps
What are competitors doing?
Aggressively entering our segment
What themes appear in employee feedback?
Burnout & workload

Different queries, different results, different numbers across the entire business. This is exactly why we built the semantic layer for business intelligence - to avoid these exact issues and align numbers across the entire business. With this new unlock LLMs provide, we've thrown our data-driven mindsets out of the window; each employee now has their own perspective of the data, given whatever the LLM spits out.


For an enterprise to be data-driven in this "fuzzy" embedding space LLMs operate in, we need to take them out of that space the best we can; a semantic layer makes this possible.

Here's how this would look for unstructured data:

What happens when AI gets a semantic layer

Same unstructured data. Same four LLMs. One governed semantic layer. Aligned, quantified conclusions.

UNSTRUCTURED SOURCESUNSTRUCTURED STORESSEMANTIC LAYER (MCP)LLM ACCESS POINTSALIGNED CONCLUSIONSCommunicationsDocumentsCustomer feedbackCall center & voiceSocial mediaMarket intelligenceObject / blob storageEnterprise search indexFile sync & shareSiftreeChatGPTMicrosoft CopilotClaude APIGoogle GeminiWhat's driving customer complaints?
Billing & pricing issues43% of 12,847 tickets
What's the sentiment on the new product?
Net positive62% favorable across 3 sources
What risks exist in our contracts?
IP indemnification gaps3 of 214 active contracts flagged
What are competitors doing?
Entering our segment2 new entrants, 14 signals detected
What themes appear in employee feedback?
Workload & burnout38% mention rate across 1,204 responses

By pre-processing and aggregating all of the unstructured data before it touches the LLM, and then connecting to each LLM service via MCP, the entire business can now be aligned on metrics that live inside this data. It's still "unlocked" by LLMs, but now they're all operating on the same, shared understanding of the data, with queries that are now 100% traceable to the source. This approach gets business back to being "data-driven" while reaping the powerful benefits of LLMs and AI agents.


What pre-processing and aggregation steps will we take to create the structured ontology for the LLMs to reason over? More on that in a different section; we call this "Inductive Intelligence".