The Data Most AI Can't Read: Inside Our Latest Research on Structured Legal Evidence

Our mission at Altumatim is to help our clients find the most important information in any dataset so they can get the best possible result for their clients. Sometimes, the most decisive evidence isn't in the documents that read like documents. It's in the spreadsheets, the logs, and the exports — it's the structured data that sits quietly in the corner of a production set, until someone realizes the answer was in there the whole time.

A general ledger, for example, can easily have several hundred thousand rows. Sales pipeline information may occupy tens of thousands of rows. Even a cap table can span a significant amount of space. In any of those, the meaningful pattern only emerges when you connect a name in column A to a date in column F to an amount in column J. This is the kind of data that sometimes wins or loses cases. And it's the data most AI systems can't read.

This is a foundational gap. And it's the subject of new research from our team, Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation, led by Pooja Guttal with academic guidance from Dr. Manas Gaur at the University of Maryland, Baltimore County, and supported by the Altumatim research team.

Why Conventional AI Fails on Structured Data

Most modern legal AI systems are built on a technique called retrieval-augmented generation, or RAG. The idea is simple. Instead of asking a language model to answer from memory, you give it relevant excerpts of your data first, and it generates an answer grounded in what you've shown it. RAG is the quiet engine behind most AI in legal work today.

But RAG depends on a step most people never think about: chunking. Before any retrieval can happen, your data has to be broken into pieces small enough for the system to work with. And here is where the trouble starts. Conventional chunking was designed for prose. It treats your data as a long ribbon of text and slices it into segments based on character count. For a contract or a deposition transcript, that works reasonably well. For a spreadsheet, it does not.

Rows get fractured. A single row in a ledger might describe the who, what, why and when of a transaction with links and quantitative relationships only revealed by considering the row entries in proper context. Conventional chunking will happily split that row across two or three pieces, severing the relationships between fields. Additionally, information that explains what a value means gets stranded in one chunk while the actual values end up in another. The data becomes ambiguous at exactly the moment it needs to be precise.

The AI may look like it is working. It returns answers. It cites passages. But it is missing the connections that actually matter, and the user has no easy way to see what has been lost.

The relationships between fields make structured data valuable, yet conventional RAG approaches strip those relationships out. As those relationships disappear, you have otherwise useful data turned into noise.

A Different Approach: Reading Data the Way It's Structured

Our research proposes a different approach. Instead of treating tabular data as text to be sliced, we treat each row as a structured unit and preserve its relationships from the start.

Each row is converted into a key-value representation, where every value stays paired with its column header. Rows are then organized into a hierarchical tree that reflects the structure of the underlying file: sheets, tables, rows, fields. When chunks need to be created, the system splits along structural boundaries, not arbitrary character counts. Related fields stay together. Headers stay attached to values. Rows are not fractured.

We call this Structure-Aware Tabular Chunking, or STC.

To validate the approach, we tested it on a benchmark of real legal agreements, the Merger Agreement Understanding Dataset, a public benchmark drawn from SEC filings. The results were significant on every dimension we measured:

More than double the accuracy in retrieving the correct answer on the first try, with significant gains across other retrieval approaches as well

Substantially sharper ranking of the most relevant results

A leaner index with 40 to 56 percent less noise

Roughly five times faster processing

These are not merely incremental improvements. They reflect the difference between an AI that can read your structured data and one that has been guessing at it.

What This Means in Practice

Research is only as valuable as what it enables. Here is where this work shows up in real legal matters.

In an investigation, the answer is almost never in one document. It is in the connection between a code or description, a date, and a transaction. Structure-aware chunking is what lets altumatimOS hold those connections together when other systems lose them.

In eDiscovery, structured data productions are often the source of the most consequential analyses and decisions. Reading them precisely, at scale, without fragmenting the relationships between fields, is the difference between defensible review and reviewable defense.

In litigation, the patterns that reframe a witness' testimony or shift a strategy frequently live in structured data: call detail records, financial logs, inventory exports. Surfacing those patterns reliably requires AI that respects the structure, not one that flattens it.

This is the engineering layer underneath what altumatimOS does. It is not a feature you will see on a screen. It is the reason the answers you get from the platform are answers you can act on.

What's Next

This research validates the approach on legal agreements, but the method generalizes. We are already extending it to other tabular sources our customers work with daily: production manifests, communications metadata, financial exports, ledgers and inventories.

Through our research, we make sure the platform our customers depend on is built on something durable.

If you would like to read the full paper, you can find it here on arXiv.

The Data Most AI Can't Read: Inside Our Latest Research on Structured Legal Evidence

Why Conventional AI Fails on Structured Data

A Different Approach: Reading Data the Way It's Structured

What This Means in Practice

What's Next

About the Author

Related Articles

Embracing the Future: LLMs in eDiscovery

Evolving Strategies: Continuous Active Learning in eDiscovery

Ready to experience the future of legal AI?