Mid-Size Law Firm AI Data Structuring Guide
June 18, 2026

Most mid-size law firms already have AI. The problem is that AI sitting on top of unstructured, siloed, inconsistently tagged data does very little. It hallucinates. It misses things. It gives lawyers answers they can't trace back to a source.
The gap between adopting AI and achieving digital maturity is not an adoption problem. It is a data structure problem. The AI is there. The foundation underneath it is not.
Mid-size law firm AI data structuring is the discipline of making that foundation solid: clean matter taxonomies, consistent metadata, governed access, and a knowledge layer that connects emails, documents, and prior cases into something lawyers can actually search and reuse. This guide covers what that looks like in practice, where firms go wrong, and how to build a stack that holds.
#01Why 95% of AI Pilots Fail at Mid-Size Firms
The failure rate is not a secret. Ninety-five percent of AI pilots at law firms fail to meet expectations, and the causes are consistent: disconnected systems, dirty data, and no governance (Wolters Kluwer, 2026). Firms run a pilot, get inconsistent results, and conclude the AI is not ready. The AI is usually fine. The data it was given was not.
Mid-size firms tend to accumulate what researchers call fragmented technology stacks, often utilizing multiple tools for a single matter. Emails live in Outlook or Gmail. Documents live in SharePoint, Clio, or a DMS. Case notes live in someone's head or a shared drive no one has curated since 2021. When AI tries to answer a question that requires context from all three, it gets partial answers at best.
The deeper issue is metadata discipline. A matter tagged "employment" in one partner's convention and "emp-lit" in another's is invisible to any AI doing classification or retrieval. Inconsistent tagging is not a minor inconvenience. It is the direct reason AI accuracy ranges wildly, as the effectiveness of tools grounded in firm data depends entirely on how disciplined the firm's matter-tagging has been.
Fix the taxonomy first. Everything else follows from that.
#02The Taxonomy-First Approach That Actually Works
Before any AI vendor demo, before any procurement decision, mid-size firms need a working matter taxonomy. Not a perfect one. A working one.
The approach that holds up in practice: start from an open industry standard like LMSS (Legal Matter Standard Structure), then layer firm-specific tags on top. LMSS gives you a defensible baseline that outside tools can read. Firm-specific tags capture the nuance that generic standards miss, such as your particular sub-types of employment litigation or your client industry classifications.
The mistake most firms make is building taxonomy from scratch in isolation, typically by asking one KM director to create a spreadsheet that never gets enforced. Taxonomy only works when it is embedded in the intake process. If attorneys can file a new matter without assigning a practice area and matter type, they will, and the AI downstream gets garbage.
Only 57% of mid-size firms are fully cloud-based (Thomson Reuters, 2026), which means many are running hybrid on-premise and cloud environments where metadata discipline is even harder to enforce. In those environments, taxonomy governance requires a named owner. A dedicated AI lead helps drive tool adoption and minimizes unauthorized shadow-AI activity. Appoint someone. Give them authority over taxonomy changes, not just advisory influence.
For firms that have done this work well, the returns are clear. High-adoption firms see improved capacity and higher client satisfaction than their peers. That is not because their AI is smarter. It is because their data is cleaner.
#03The 2026 Mid-Size Firm AI Stack, Honestly Assessed
The best current stack for mid-size law firm AI data structuring is modular. No single tool does everything. Here is what the realistic picture looks like.
General AI foundation: Claude Enterprise runs at $60 to $100 per user per month and gives you enterprise security controls for general drafting and analysis. It has no native Westlaw integration, which limits its value for citation-grounded research.
Legal research and drafting: Thomson Reuters CoCounsel is citation-grounded and typically bundled with Westlaw at roughly $150 to $428 per month per user, depending on the bundle. For firms that need defensible research output, this is the current benchmark.
Contract and transactional structuring: Spellbook runs at approximately $500 per user per month and learns from your precedent library. It is typically deployed for 15 to 30 transactional attorneys rather than firm-wide, which keeps costs manageable.
eDiscovery: Everlaw handles TAR and review workflows and is often positioned around $250 per month in initial conversations, though per-matter costs vary by volume.
Firm-wide knowledge structuring: This is where the stack gets complicated. Microsoft Copilot grounded in Microsoft Graph can work well, but only if your matter-tagging discipline is already in place. If it is not, Copilot surfaces noise as readily as signal.
None of these tools alone solves the underlying problem of connecting case-level knowledge across emails, documents, and prior matters. That is a separate architectural need, and it is where platforms like Casero sit. Casero connects emails, documents, and case management systems into a living knowledge graph, with entity extraction that identifies people, organizations, dates, events, and obligations, then maps the relationships between them. Every fact traces back to the source passage. No black boxes.
#04What Structured Case Knowledge Actually Looks Like
Abstract descriptions of "structured data" do not help attorneys understand what changes in practice. A concrete before-and-after is more useful.
Before structuring: An associate researching a new employment discrimination matter searches the DMS, gets 400 results sorted by date, spends two hours reading documents that are tangentially relevant, and never learns that a partner handled a nearly identical case three years ago. The prior work product is unreachable because it was never tagged in a way that makes it findable.
After structuring: The same associate runs a plain-English search. The system surfaces the prior case based on matching legislation, factual circumstances, and case classification, with a multi-dimensional score showing exactly why each result matched. The associate reads the prior strategy memo in 15 minutes instead of two hours.
That second scenario requires three things: consistent matter metadata on ingestion, entity extraction that maps the relationships between documents, and a search layer that understands intent rather than keywords. Semantic search that distinguishes between documents that merely mention a statute and documents where that statute is the central issue is a different capability than keyword search. Most firms have keyword search. Very few have the semantic layer.
Moving data into shared, governed assets that produce measurable business outcomes remains a significant challenge for mid-market firms. The gap between having AI tools and having structured, reusable knowledge is where most mid-size firms are currently stuck. See our guide on structured case knowledge for attorneys for a detailed walkthrough of what that transition involves.
#05Data Privacy Is Not Optional, and Most Firms Are Exposed
Mid-size firms adopting AI data structuring tools face a specific privacy risk that larger firms have learned the hard way: client data leaking into general AI model training.
Several general-purpose AI tools used in legal contexts have default settings that send user inputs to model training pipelines unless explicitly disabled. Most attorneys do not know this. Most IT directors know but have not yet rolled out firm-wide controls. This is how shadow AI spreads: an attorney pastes a client memo into ChatGPT because the approved tool is slow, and now that confidential information is in a training dataset.
When evaluating any AI tool for mid-size law firm data structuring, ask three specific questions: Does the vendor train on client data by default? Is tenant data isolated, meaning your data cannot be queried by or influence outputs for another firm? What encryption standards apply at rest and in transit?
Casero is explicit on all three: client and firm data is never used to train AI models, each firm's data is held in strict isolation with no cross-firm data sharing, and enterprise-grade encryption applies at rest and in transit. For firms that need to demonstrate compliance to clients or regulators, Casero also maintains an audit trail of every access event, including who queried what and which document produced the answer.
Casero is on a roadmap toward SOC 2 and ISO compliance, with a security whitepaper available on request during pilot onboarding. Certifications are not yet achieved, which is worth knowing before signing a contract. For a broader checklist of what to verify, see the legal AI security checklist for law firms.
#06Where Governance Makes or Breaks the Investment
Technology without governance degrades. A taxonomy built in 2025 with no enforcement mechanism will look like a mess by 2027, because attorneys will shortcut it, systems will drift, and no one will own the remediation.
Mid-size firms that sustain AI data structuring investments share a few structural traits. They have a named AI lead with real authority, not just a title. They have a documented taxonomy change process that requires approval before new tags are added. They run quarterly audits of matter-tagging compliance, not annual ones. And they have a centralized knowledge library with metadata standards, automated curation, and PII scrubbing before any document enters the shared repository.
The last point deserves emphasis. Firms that dump raw files into a knowledge base without redaction or verification create legal and ethical exposure. A client communication containing privileged material should not be retrievable by every attorney at the firm. Ethical wall adherence has to be built into the structuring layer, not bolted on after the fact.
This requires enforcing the security parameters already set in the firm's existing document management system. If a lawyer cannot access a document in the DMS, they should not be able to query it through the AI layer either. That is the right architecture: permission logic lives once, enforced everywhere, rather than maintained separately in each tool.
For firms building governance frameworks from scratch, the law firm AI governance framework guide covers the policy, training, and oversight structures that make AI investments durable.
#07The ROI Math Mid-Size Firms Need to Run
Mid-size firm partners are not wrong to demand a business case before committing to AI data structuring infrastructure. The question is which numbers to use.
The most defensible ROI calculation for mid-size law firm AI data structuring focuses on attorney time recovered from administrative work: time spent searching for prior documents, reconstructing case history, re-researching questions that have already been answered on prior matters, and manually tagging or organizing files. A firm billing at $400 per hour that recovers two hours per attorney per week across 30 attorneys is looking at roughly $1.25 million in recovered capacity per year, assuming that time goes to billable work rather than being absorbed by overhead.
Casero's own ROI illustration on their site models approximately £10,620 per year for 15 lawyers yielding an estimated net value of £745,380. That is illustrative, not a published price, and your firm's numbers will vary. Run your own model using your billing rates, your attorney count, and a conservative estimate of weekly hours lost to knowledge retrieval.
High-adoption firms already report 65% more capacity than low-adoption peers (Thomson Reuters, 2026). The capacity is there. The question is whether your data infrastructure is structured well enough for AI to find it. For a detailed ROI framework, see law firm AI ROI: making the business case.
Mid-size law firm AI data structuring is not a technology purchase. It is a decision about whether your firm's institutional knowledge is going to be an asset or a liability over the next five years. Firms that build clean taxonomy, enforce consistent metadata, and connect their case data into a structured, searchable layer will extract real capacity from AI tools. Firms that keep adding tools on top of unstructured data will keep running failed pilots.
If your firm's prior work product is currently locked inside untagged documents and unsearchable email threads, the starting point is a living knowledge graph that extracts entities, maps relationships, and links every insight back to the source document. That is precisely what Casero is built to do. Book a demo and ask them to run a pilot on a single practice group's matter history. See how much institutional knowledge is already there, waiting to be retrieved.
Frequently Asked Questions
In this article
Why 95% of AI Pilots Fail at Mid-Size FirmsThe Taxonomy-First Approach That Actually WorksThe 2026 Mid-Size Firm AI Stack, Honestly AssessedWhat Structured Case Knowledge Actually Looks LikeData Privacy Is Not Optional, and Most Firms Are ExposedWhere Governance Makes or Breaks the InvestmentThe ROI Math Mid-Size Firms Need to RunFAQ