All Posts
Kush Madlani
 ・ 
CTO

Wexler at scale: Reasoning at 1 million pages

Wexler at scale: Reasoning at 1 million pages

Litigation relies on documents; and, in the digital age, an ever increasing number.

A commercial dispute routinely involves hundreds of thousands of files: emails, contracts, board minutes, deposition transcripts, spreadsheets, WhatsApp messages. Much of the work of running the case is reading them, understanding what happened, and assembling a chronology that reflects this accurately. That work used to be done by a team of paralegals, trainees, and associates. Now AI supports.

While most AI tools handle a hundred pages well, and 5,000 acceptably, they handle 50,000 badly. Past that, they tend to stop being useful.

This post explains why that happens, and the technology behind how Wexler can handle more than 1,000,000 pages.

The document volume problem(s)

There are five problems every AI tool working at this scale has to solve. None of them are fixed when a new model is released: they're properties of the problem.

1. What the model can see at once

Models that hold a finite number of tokens at a time (a token being roughly three-quarters of a word). Context windows have grown significantly, and the most capable models today can hold around one million tokens, or approximately 750,000 words; call it 1,500 pages of evidence. That sounds substantial until you set it against the scale of a serious commercial dispute, which can run to tens of millions of pages. Even the largest context window available today covers a fraction of a percent of a typical large disclosure corpus, and that gap is not closing in any meaningful practical sense: document volumes grow with the disputes, and the architectural constraints on context size are real.

This means every time a model is called, a decision must be made about what to put in front of it and what to omit. Get that decision wrong and the model reasons from the wrong evidence. The work of deciding what the model sees, and ensuring those decisions are right, is most of what determines whether the output is accurate and comprehensive.

2. The cost of reading

Most disclosure material doesn't arrive as readable text. Scanned PDFs, image files, and photographs of documents have to be converted into machine-readable text before any model can reason over them, a process that runs as a paid call to an external service. Beyond that, every document in the corpus has to be processed, chunked, and indexed so that the right material can be retrieved when needed. None of this is expensive per document. At the scale of a large disclosure exercise (hundreds of thousands of documents, often tens of millions of pages) it adds up to a material line in the matter budget, and that's before a single model inference has been run.

3. The cost of querying

The most capable foundation models are the most expensive to run. A single call to the best model costs several times the cost of a mid-tier model call, and the gap between the two on most tasks is much smaller than the gap in price.

Routing every document through the best model for every task would spend the matter budget on extraction alone. An efficient pipeline runs an appropriate share of the work on cheaper models, and a meaningful share with no model at all (using traditional machine learning). The discipline of asking, for every call, whether it's the right model for the task is the difference between a pipeline that runs at scale and one that runs out of money halfway through.

4. Every fact has to be sourced to the evidence

Given that legal AI output is used to draft pleadings, witness statements, expert briefings, and prepare for cross-examination, the acceptable error rate isn't "almost right". A fact that's wrong by a single date, or a person, or a document attribution can change the outcome of the case.

Two things follow. The first is that every fact has to be sourced back to the specific sentence in the document that supports it, and the traceability has to survive every step of the pipeline. If a fact is extracted at stage one, summarised at stage three, and quoted in a chronology at stage six, the link back to the original sentence has to be intact at every stage. The second is that the upstream engineering has to be designed so that errors don't compound. A small mistake in document classification at ingest can cascade into a much larger mistake in the chronology if the architecture doesn't catch it.

5. Reconciling references to the same event

The hardest of the five. Imagine a witness statement that says the meeting was on Tuesday the 14th, an email dated the 21st referring to "our discussion last week", and a board minute about the kick-off. Three references, possibly to one event, or possibly to three separate events. Working out which is the difference between a usable chronology and a list of fragments. A person can do it, and on a small set, most legal AI tools can do it. On a real disclosure set, with a hundred thousand factual references to reconcile, neither can do it accurately by comparing every single fact to every other.

Comparing every fact to every other fact is what computer scientists call n-by-n. A thousand facts means a million comparisons. A hundred thousand means ten billion. No amount of compute handles ten billion comparisons for a single case at a workable price. In high-stakes litigation, the cost of missing these contradictions can mean the difference between winning and losing.

Wexler’s Fact Ontology

Wexler is the product. AI is the engine in Wexler that reads natural language. Most of what makes Wexler work at scale is the work that happens around the AI. Most of the work happens before the AI sees a document.

When a file arrives, Wexler reads it before the AI does. The file extension describes the file. The metadata adds to it. After OCR, the layout adds further information: where the headers are, where the footnotes sit, where the signature block lives. By the time the AI sees the document, Wexler already knows it's a transcript, an MSG email file with 17 attachments, a Teams thread where the relevant exchange runs across a dozen messages over three days, or a spreadsheet. Each one is handled differently, so transcripts are read with awareness of who's speaking, emails are reconstructed in the order they were sent, WhatsApp messages are reassembled across the timeline, spreadsheets are read cell by cell, and so on.

This sounds obvious, but is rarely done by legal AI tools. A general-purpose AI tool treats every document as text, because building 15 different document handlers is a lot of work that doesn't pay off unless documents are what you do.

As much as possible happens without AI at all. There's a family of older techniques (semantic splitting, finding the dense parts of a document, picking out where defined terms cluster in a contract) that work better and cost less than asking the AI to do the same thing.

This all runs in parallel. A million pages process across distributed compute that scales up when the job is big and shuts down when it's done. Asynchronous queues manage the dependency between services so a slow OCR call doesn't block everything else.

Back to the meeting on the 14th.

Comparing every fact to every other doesn't scale, so we use a technique borrowed from how search engines indexed the early web. Each fact gets reduced to a short signature designed so that similar facts end up with similar signatures. Instead of comparing every fact to every other, you only compare facts whose signatures are close. The comparison space collapses from quadratic to near-linear: ten billion comparisons becomes something closer to a hundred thousand.

Each fact keeps its attribution through this process, so "Witness A says the meeting was on the 14th" and "the email confirms the meeting was on the 14th" are held as two assertions about the same event, not merged into one.

The output is a chronology where the meeting on the 14th, the discussion last week and the kick-off resolve to a single event with three sources attached. If they're actually three different events, that shows up too. Inconsistencies surface rather than remain hidden, and new production extends the chronology rather than breaking it.

Specialism matters

A general-purpose AI tool tries to be useful for everything. That's one approach, but it makes the per-document-type engineering hard to justify. If you're building for everyone, you can't spend a year getting transcript handling right.

We can, because documents and facts are what we do. The depth is possible because litigation is the only thing for which the tool exists.

As a result, Wexler is being used in disputes worth hundreds of millions to billions, by teams of more than a hundred lawyers on a single matter. The factual record is only as good as the work that goes into joining it up.

Start your
fact-finding journey