Why Recursive Character Splitting Fails for Legal Clauses
And how clause-aware semantic chunking increased retrieval accuracy by 47%.
The Default Approach: Recursive Character Splitting
Most AI orchestration frameworks ship with a simple chunking strategy: split text into fixed-size windows (e.g., 500 tokens) with some overlap (e.g., 50 tokens). This approach works reasonably well for blog posts, knowledge base articles, and general-purpose documents where context is relatively self-contained within a few paragraphs.
But legal documents are different. Fundamentally different.
Why Legal Documents Break Naive Chunking
A typical commercial contract has several structural properties that make character-based splitting catastrophic:
- Cross-referencing definitions: A clause on page 12 might reference a definition on page 2 ("as defined in Section 1.3(b)"). If your chunk doesn't include that definition, the AI has no way to resolve the reference.
- Multi-page clauses: A single indemnification clause can span 3-4 pages. Splitting it at 500 tokens means the AI sees fragments of the obligation without understanding the full scope.
- Nested conditions: Legal language uses deep nesting ("Subject to Section 4.2, except where the conditions of Section 7.1(a)(ii) are met..."). Breaking these mid-sentence destroys the logical chain.
- Schedules and exhibits: Key information lives in appendices that are semantically linked to the main body but structurally separate.
Our Clause-Aware Algorithm
Instead of splitting by character count, we built a chunking strategy that understands legal document structure:
- Clause boundary detection: We parse section headers, numbered paragraphs, and defined term markers to identify natural clause boundaries.
- Definition linking: When a chunk references a defined term, we automatically prepend the definition to the chunk's metadata, giving the AI full context.
- Cross-reference resolution: Section references ("as per Section 3.2") are resolved and the referenced text is included as supplementary context.
- Hierarchical embedding: We embed at multiple granularity levels — full clause, sub-clause, and paragraph — so retrieval can match at the appropriate level of specificity.
Results: 47% Accuracy Improvement
In a deployment for a 150-attorney law firm, we compared our clause-aware chunking against the default recursive character splitter across 500 test queries based on real associate research questions.
The improvement was most dramatic for queries involving cross-referenced obligations and multi-condition clauses — exactly the kind of questions that junior associates spend hours researching manually.
Key Takeaways
If you're building an AI system for legal documents, the chunking strategy matters more than the model. A smaller model with great chunking will outperform a frontier model with naive splitting every time. The "Last Mile" in legal AI isn't about intelligence — it's about context preservation.