What Is Unstructured Data in Law? A Guide
May 3, 2026

Ask any associate where the key facts from last quarter's case are, and you will get the same answer: somewhere in the emails, probably in a PDF, maybe in the notes from the deposition. That is unstructured data in law. It is not a niche technical problem. It is the default state of how legal work gets done.
Unstructured data in law refers to legal information that exists outside predefined databases or organised models: contracts, correspondence, discovery documents, deposition transcripts, regulatory filings, scanned exhibits, and the thousands of emails generated over the life of a matter. None of it arrives pre-sorted. None of it is machine-readable without additional processing. And almost all of it is where the actual legal substance lives.
Approximately 80 to 90% of all enterprise data is unstructured, and that figure is projected to climb to 90% of all new data created globally (typedef.ai, 2026). Law firms sit at the extreme end of that curve. The question is no longer whether firms have an unstructured data problem. It is what they plan to do about it.
#01The difference between structured and unstructured legal data
Structured data lives in rows and columns. A case management system that records matter number, client name, filing date, and billing code is structured. You can query it. You can sort it. A database knows exactly where every field is.
Unstructured data has no such shape. A 200-page deposition transcript is a wall of text. A contract is a document whose clauses, obligations, and defined terms exist only as prose. An email chain threading across six months contains facts, decisions, and commitments buried inside informal language with no consistent format.
The critical point: the structured data in a typical law firm is mostly administrative. The unstructured data is mostly substantive. Billing codes do not win cases. The clause on page 47 of the NDA might.
Semi-structured data sits between the two. A court filing with consistent section headings, or an email with a predictable header block, has some organisation without being fully machine-readable. Semi-structured formats are easier to process than raw scanned images but still require extraction work before they become queryable. See What Is Legal Data Structuring? A Plain-Language Guide for a fuller breakdown of how the conversion process works.
#02What unstructured legal data actually looks like
Legal teams deal with the same categories of unstructured data repeatedly, just at different scales depending on matter type.
Contracts and transactional documents. NDAs, service agreements, shareholder deeds, and lease agreements are stored as PDFs or Word files. The obligations, termination clauses, and indemnities inside them are not tagged or indexed anywhere. Finding every contract with a specific limitation of liability cap requires reading each one, unless you have extraction tooling.
Discovery materials. In a typical commercial litigation matter, discovery can run to tens of thousands of documents. Emails, internal memos, spreadsheets, and scanned correspondence arrive in bulk. Most of it is irrelevant. Some of it is critical. Identifying which is which through manual review is where legal teams spend a disproportionate share of their time.
Deposition transcripts and witness statements. Transcripts are text-heavy and long. Key admissions, contradictions, and factual anchors are scattered across hundreds of pages with no tagging or cross-referencing.
Regulatory filings and correspondence. Submissions to regulators, responses from agencies, and internal compliance memos accumulate over months or years. Tracking what was said, when, and by whom becomes a research task in itself.
Internal emails and file notes. Legal decisions happen in inboxes. Instructions, client updates, strategy discussions, and fact confirmations live in email threads that are almost never indexed or connected to the relevant matter in any useful way.
All of these formats share the same problem: the information exists, but it is not findable at speed without reading the source document in full.
#03Why unstructured data is a genuine operational problem, not a filing issue
The instinct is to frame unstructured data as a storage or organisation problem. File things properly, use a good DMS, and you will be fine. That framing is wrong.
The real cost is retrieval and reuse. A document management system tells you a file exists and where it is stored. It does not tell you what the file says, how it relates to the claim on the current matter, or whether a case from three years ago has a directly relevant precedent buried in it. Keyword search helps at the margins but fails on concept-level queries. Searching for "force majeure" will not surface a contract that uses "act of God" as the operative term.
The unstructured data management market is projected to reach $109.1 billion by 2033, growing at a 15.5% CAGR (typedef.ai, 2026). That growth is driven by organisations discovering they are sitting on enormous volumes of high-value content they cannot actually use.
For law firms specifically, the cost shows up in three places. First, duplicated research: associates rebuild analysis that already exists somewhere in the firm because they cannot find it. Second, missed risk: contract review misses an obligation because manual reading is imperfect at volume. Third, institutional knowledge loss: when a senior partner leaves, the accumulated insight from their matters goes with them unless it has been extracted and made accessible. See Law Firm Institutional Knowledge Loss: The Fix for a detailed look at that specific failure mode.
#04How AI converts unstructured legal data into structured knowledge
The mechanism matters here. Vague claims about "AI processing documents" explain nothing.
Current AI-driven approaches to unstructured legal data use several named techniques in combination.
Entity extraction. A model reads a document and identifies the people, organisations, dates, events, and obligations referenced in the text. These entities become nodes that can be queried and connected. A deposition transcript stops being a flat text file and becomes a set of facts about specific actors at specific times.
Semantic indexing and vector embeddings. Rather than matching exact keywords, vector embeddings represent the meaning of a passage as a numerical vector. A query for "early termination rights" will surface clauses about exit provisions even if those words do not appear verbatim. This is the mechanism behind concept-level search.
Relationship mapping. Entity extraction alone gives you a list of named things. Relationship mapping tells you how those things connect: this person was present at this event, this obligation was triggered by this date, this clause modifies this other clause. The output is a graph, not a list.
Multi-pass verification. Higher-quality extraction tools run multiple passes over the same document, cross-checking extracted facts against context to reduce errors before surfacing results (discoverlex.com, 2026).
Casero uses this approach to build what it calls a knowledge graph at the case level. Every entity extracted from a document or email is mapped to its source passage, so a lawyer querying the graph can trace any fact back to the exact line it came from. The knowledge graph updates automatically as new documents arrive, so it reflects the current state of a matter without requiring manual uploads or re-indexing.
The result is that unstructured data from contracts, emails, transcripts, and filings becomes queryable in plain English. "Which matters involve a director called James Chen and a breach of warranty claim?" becomes a question the system can answer in seconds rather than a research project.
#05What structured legal knowledge actually enables
Once unstructured data has been extracted, structured, and connected, the capabilities it unlocks are specific and concrete.
Semantic search across all matters. Instead of keyword search within a single document, lawyers can query across the entire firm's history using natural language. Casero's semantic search covers emails, documents, prior cases, and legislation simultaneously, returning results ranked by contextual relevance rather than keyword frequency.
Similar case matching. When unstructured data from past matters has been structured into a knowledge graph, a new matter can be automatically matched against prior cases based on legislation, factual pattern, and case classification. Casero surfaces these matches with multi-dimensional scoring that shows why each past case matched, not just that it did.
Contract and obligation tracking. Structured extraction enables firms to identify and track obligations, deadlines, and defined terms that would otherwise remain buried in a PDF.
Reusable institutional knowledge. When prior work is structured and searchable, it does not disappear when a lawyer moves on. The analysis stays in the firm. Casero's Legal Library provides a centralised, searchable repository for internal precedents, templates, and case studies, connected to the same knowledge graph that governs live matters.
For a deeper look at how this plays out in practice, see Structured Case Knowledge: What Attorneys Gain.
#06Red flags when evaluating legal AI tools for unstructured data
Not every platform that claims to handle unstructured legal data actually does. Several failure modes show up repeatedly in 2026.
Keyword-only search dressed as AI. If a tool's search cannot find relevant documents when the query uses different words from the document text, it is not doing semantic search. It is doing a fancier keyword match. Test this directly with a synonym query before buying.
Black-box extraction. If a tool extracts facts from documents but cannot tell you which passage each fact came from, you cannot verify its output. In legal work, unverifiable AI output is a liability risk. Ask any vendor you evaluate to demonstrate that every node in the knowledge graph traces back to the exact passage it came from.
Training on client data. Some AI tools improve their models by training on user inputs. For law firms, that means client data feeds a model that may also be used by other firms. Confirm in writing whether a vendor uses client data to train AI models before connecting sensitive matter files.
Stale intelligence. A tool that requires batch uploads or manual re-indexing will always lag behind the current state of a matter. Live synchronisation, where changes in connected systems are mirrored immediately, is the standard that matters. Anything less means lawyers are searching outdated data.
See Legal AI Vendor Evaluation Checklist: Law Firms for a full framework to apply when assessing tools in this space.
Unstructured data in law is not going to organise itself. The volume is growing, the formats are multiplying, and manual review at scale is not a strategy. The firms that treat their unstructured data as a liability to be managed will keep losing hours to document searches and knowledge gaps. The firms that treat it as an asset to be structured will get compounding returns: faster research, reusable prior work, and institutional knowledge that survives personnel changes.
If your firm's case knowledge currently lives in PDFs, email threads, and shared drives with no semantic search or entity mapping connecting them, Casero is built for exactly that problem. It ingests your documents and emails, builds a living knowledge graph at the case level, and makes everything searchable in plain English from day one, with no client data used to train any models and no black boxes in the output. Start a pilot with full Professional-tier access and no commitment required, and see what your unstructured data actually contains.
Frequently Asked Questions
In this article
The difference between structured and unstructured legal dataWhat unstructured legal data actually looks likeWhy unstructured data is a genuine operational problem, not a filing issueHow AI converts unstructured legal data into structured knowledgeWhat structured legal knowledge actually enablesRed flags when evaluating legal AI tools for unstructured dataFAQ