Wexler at scale: Reasoning at 1 million pages
Litigation relies on documents; and, in the digital age, an ever increasing number.
A commercial dispute routinely involves hundreds of thousands of files: emails, contracts, board minutes, deposition transcripts, spreadsheets, WhatsApp messages. Much of the work of running the case is reading them, understanding what happened, and assembling a chronology that reflects this accurately. That work used to be done by a team of paralegals, trainees, and associates. Now AI supports.
While most AI tools handle a hundred pages well, and 5,000 acceptably, they handle 50,000 badly. Past that, they tend to stop being useful.
This post explains why that happens, and the technology behind how Wexler can handle more than 1,000,000 pages.
The document volume problem(s)
There are five problems every AI tool working at this scale has to solve. None of them are fixed when a new model is released: they're properties of the problem.
- What the model can see at once
Models that read text hold a few hundred thousand tokens at a time. A token is roughly three-quarters of a word, so a model with a 200,000-token context window can hold around 150,000 words, or about 300 pages of evidence. A large commercial dispute can run to tens of millions of pages, which means tens of billions of tokens.
This means the ordinary LLM can't see the entire document corpus at once. Every time the model is called, a decision has to be made regarding what to put in front of it, and what to omit. Get that decision wrong and the model is reasoning from the wrong evidence. The work of deciding what the model sees, and making sure the decisions are right, is most of what determines whether the output is accurate and comprehensive.
- The cost of reading
Most disclosure material doesn't arrive as readable text. It arrives as scanned PDFs, image files, photographs of documents, and the model can't read any of it directly. Optical character recognition (OCR) turns those images into machine-readable text, and it runs as a paid call to an external service.
The unit cost to OCR is small, but the volume isn't. A corpus of 200,000 documents averaging 50 pages each is 10 million OCR calls. 10 million calls at a few cents apiece is a meaningful share of any matter budget, and that's before any model has been invoked.
- The cost of querying
The most capable foundation models are the most expensive to run. A single call to the best model costs several times the cost of a mid-tier model call, and the gap between the two on most tasks is much smaller than the gap in price.
Routing every document through the best model for every task would spend the matter budget on extraction alone. An efficient pipeline runs an appropriate share of the work on cheaper models, and a meaningful share with no model at all (using traditional machine learning). The discipline of asking, for every call, whether it's the right model for the task is the difference between a pipeline that runs at scale and one that runs out of money halfway through.
- Every fact has to be sourced to the evidence
Given that legal AI output is used to draft pleadings, witness statements, expert briefings, and prepare for cross-examination, the acceptable error rate isn't "almost right". A fact that's wrong by a single date, or a person, or a document attribution can change the outcome of the case.
Two things follow. The first is that every fact has to be sourced back to the specific sentence in the document that supports it, and the traceability has to survive every step of the pipeline. If a fact is extracted at stage one, summarised at stage three, and quoted in a chronology at stage six, the link back to the original sentence has to be intact at every stage. The second is that the upstream engineering has to be designed so that errors don't compound. A small mistake in document classification at ingest can cascade into a much larger mistake in the chronology if the architecture doesn't catch it.
- Reconciling references to the same event
The hardest of the five. Imagine a witness statement that says the meeting was on Tuesday the 14th, an email dated the 21st referring to "our discussion last week", and a board minute about the kick-off. Three references, possibly to one event, or possibly to three separate events. Working out which is the difference between a usable chronology and a list of fragments. A person can do it, and on a small set, most legal AI tools can do it. On a real disclosure set, with a hundred thousand factual references to reconcile, neither can do it accurately by comparing every single fact to every other.
Comparing every fact to every other fact is what computer scientists call n-by-n. A thousand facts means a million comparisons. A hundred thousand means ten billion. No amount of compute handles ten billion comparisons for a single case at a workable price. In high-stakes litigation, the cost of missing these contradictions can mean the difference between winning and losing.
Wexler’s Fact Ontology
Wexler is the product. AI is the engine in Wexler that reads natural language. Most of what makes Wexler work at scale is the work that happens around the AI. Most of the work happens before the AI sees a document.
When a file arrives, Wexler reads it before the AI does. The file extension describes the file. The metadata adds to it. After OCR, the layout adds further information: where the headers are, where the footnotes sit, where the signature block lives. By the time the AI sees the document, Wexler already knows it's a transcript, an MSG email file with 17 attachments, a Teams thread where the relevant exchange runs across a dozen messages over three days, or a spreadsheet. Each one is handled differently, so transcripts are read with awareness of who's speaking, emails are reconstructed in the order they were sent, WhatsApp messages are reassembled across the timeline, spreadsheets are read cell by cell, and so on.
This sounds obvious, but is rarely done by legal AI tools. A general-purpose AI tool treats every document as text, because building 15 different document handlers is a lot of work that doesn't pay off unless documents are what you do.
As much as possible happens without AI at all. There's a family of older techniques (semantic splitting, finding the dense parts of a document, picking out where defined terms cluster in a contract) that work better and cost less than asking the AI to do the same thing.
This all runs in parallel. A million pages process across distributed compute that scales up when the job is big and shuts down when it's done. Asynchronous queues manage the dependency between services so a slow OCR call doesn't block everything else.
Back to the meeting on the 14th.
Comparing every fact to every other doesn't scale, so we use a technique borrowed from how search engines indexed the early web. Each fact gets reduced to a short signature designed so that similar facts end up with similar signatures. Instead of comparing every fact to every other, you only compare facts whose signatures are close. The comparison space collapses from quadratic to near-linear: ten billion comparisons becomes something closer to a hundred thousand.
Each fact keeps its attribution through this process, so "Witness A says the meeting was on the 14th" and "the email confirms the meeting was on the 14th" are held as two assertions about the same event, not merged into one.
The output is a chronology where the meeting on the 14th, the discussion last week and the kick-off resolve to a single event with three sources attached. If they're actually three different events, that shows up too. Inconsistencies surface rather than remain hidden, and new production extends the chronology rather than breaking it.
Specialism matters
A general-purpose AI tool tries to be useful for everything. That's one approach, but it makes the per-document-type engineering hard to justify. If you're building for everyone, you can't spend a year getting transcript handling right.
We can, because documents and facts are what we do. The depth is possible because litigation is the only thing for which the tool exists.
As a result, Wexler is being used in disputes worth hundreds of millions to billions, by teams of more than a hundred lawyers on a single matter. The factual record is only as good as the work that goes into joining it up.
