Does GPT-5 represent an AI plateau?

Sam Altman said GPT-5 is the first model that feels like “talking to a PhD-level expert,” claiming it’s “significantly better” with fewer hallucinations and more honest reasoning. So when it was released, expectations were high. But for all the anticipation, it feels flat.

As Michael Rovatsos usefully puts it here: Altman’s claims don’t hold up. The improvements look more like incremental tuning (slightly better coding, fewer hallucinations, and a router that delegates tasks across existing models) rather than a breakthrough. GPT-5 still lacks real reasoning or memory, and its benchmark results suggest limits to what scaling can achieve. Rovatsos suggests this may mark not a leap toward AGI but a plateau, where progress depends on more controlled, engineered systems rather than ever-larger language models.

A key idea here is capability overhang: even if models stopped improving today, there’s still enormous untapped potential inside them. This plateau doesn’t mean progress has stopped. Instead, it means the next leap won’t come from waiting for GPT-6.

At Wexler, we’re not going to waste time waiting for the next model that may or may not be more powerful. Instead, we’re focused on building structured systems that make the most of the models we have now. This comes from chaining models together and using each where it is strongest to produce outputs that are testable, verifiable and defensible (and evaluating each new release and switching in models if it means a step change for each of our tasks). This is progress.

Chaining models, not chasing the next release

No single model does it all. I find GPT-5 is useful for breadth, Gemini 2.5 Pro for reasoning, and Claude 4 for clear drafting. Each excels at something different, and litigation requires all three in concert.

Here’s how I look at it:

GPT – Best at breadth. Great for scanning large volumes of text and generating varied outputs.

Claude – Best at conciseness. Strong in clear, fluent drafting and summarisation.

Gemini – Best at logical reasoning. Excels in complex tasks that require step-by-step analysis.

This aligns with research from MIT Sloan released in August. A study found that only half of the performance gains from switching to a more advanced model come from the model itself. The rest come from how users adapt and design around it.

Shopify CEO Tobi Lütke, took this idea further when he said context engineering beats prompt engineering when working with large language models.

How does this apply to litigation? You can’t rely on one leap in capability. You have to design the environment to get the best outputs. With Wexler, you get context engineering at scale: it handles millions of documents while ensuring each query has the right context.

Determinism vs stochasticity

Large language models are stochastic. That means their outputs are generated by probability distributions rather than fixed rules. Ask the same question twice and you’ll get two different answers. That unpredictability isn’t always a weakness. In creative drafting, brainstorming, or exploring alternative phrasings, stochasticity is what makes models useful: it gives you range, variation, and new angles you might not have thought of.

But litigation demands determinism. A lawyer can’t rely on an answer that might change on the next run. Every fact has to be sourced, every step retraceable. You have to be able to explain not only what the model produced but why.

So, how do you get determinism from stochastic models? You build it into the workflow:

Constrain the task. Narrow prompts minimise randomness. Instead of “summarise this case,” you ask the model to “extract all dates of correspondence between parties.”
Chain and cross-check. One model proposes candidate facts, another verifies them, and a third tests for consistency. The redundancy reduces error.
Link to sources. Every output must cite the passage it came from. Without evidence, there is no determinism.
Mix models with rules. Some steps, like date parsing or entity recognition, are better done with deterministic algorithms. The hybrid approach stabilises the pipeline.

Take disclosure. If you ask a single model to flag contradictions in a set of 50,000 emails, its results will drift with each run. Chain the task instead: one model identifies candidate contradictions, a second verifies them, a third tests them in context. The result is stable, auditable, and, most importantly, defensible.

Litigation needs determinism

This need for determinism is amplified because litigation is contested in ways other domains are not. Software development tolerates failure because code can be recompiled. Customer service scripts can be rewritten without consequence. Even contract drafting follows playbooks where automation is relatively safe, and in e-discovery, some margin of error is acceptable because results can be checked in bulk and rerun at scale.

Litigation has no such margin. One missed contradiction in a witness statement or one overlooked email can shift the case itself. A contradiction exposed in cross-examination can change the direction of a trial. Errors here can’t be patched later.

Chaining models is one way to reduce those risks. If one model slips, another can pick up the gap. Each step is checked, each output is traceable, and final judgment rests with the lawyer.

Litigation also demands auditability. Every fact must trace back to a source, every conclusion to a line of reasoning. Without that traceability, neither clients nor courts will accept the result.

Chaining builds auditability into the process itself. Because each model handles a bounded task, the workflow produces outputs that are naturally broken into discrete, reviewable steps. Candidate facts can be proposed, verified, and cross-checked, with citations carried forward at every stage. Hybrid rules-based components add stability, and the lawyer can step in at any point to inspect reasoning (which is how we think it should work).

A scalpel, not a Swiss army knife

Our philosophy at Wexler is to be a scalpel, not a Swiss army knife. We don’t try to cover every practice area and instead focus only on disputes. Our platform processes documents, extracts and links facts, builds chronologies, and flags contradictions. Our assistant, KiM, carries out multi-step tasks, but always with citations and always inside a controlled workflow. KiM can suggest a timeline or surface red flags, but it does not decide which matters most. That remains the lawyer’s job.

Going back to capability overhang and the idea there’s enormous untapped potential left in the models we have now: at Wexler, we are drawing the most out of every model, at every stage.

Chaining comes in here. Different models are used for different stages: one to surface potential facts, another to test consistency, another to draft outputs. Each step is linked in sequence so errors are caught and reasoning is strengthened. What matters is not a single “all-purpose” model, but a chain designed around the needs of litigation, which turns a raw model output into something a lawyer can trust and defend in court.

There's a danger of treating models as if they can do everything. When performance feels flat, the temptation is to push the model harder, ask it to stretch further, and trust it on tasks it isn’t suited for. That’s risky. A misplaced assumption, an unchecked contradiction, or an overlooked source can ripple through a case theory.

Chaining models lowers that risk. Instead of relying on a single leap of reasoning, you build a sequence where each model’s output is checked, structured, and constrained by the next. One model surfaces facts, another tests contradictions, a third organises the chronology. Each link in the chain is visible and open to inspection.

GPT-6 or GPT-7 aren’t going to save us

GPT-6, or GPT-7, or... whatever... aren’t going to transform litigation. The truth is that the ceiling on raw model performance is less important than the structure you build around it. Progress will come from workflows that turn stochastic outputs into deterministic results, from chaining models so that weaknesses are caught before they matter, and from embedding verification at every step.

Litigation does not reward creativity for its own sake. It rewards precision, traceability, and judgment. That means the winning strategies in legal AI will not be about betting on the next release. They will be about design: designing chains of models that complement one another, designing processes where evidence is always cited, designing systems that give lawyers confidence in what they are putting before a court.

The plateau in models isn’t a dead end. It’s a signal that the next breakthroughs will be in engineering and workflow, not in the hype cycle of new releases.

What do you think? Drop me an email, or if you’d like to see how Wexler works, book a call.

Start your
fact-finding journey

Book a demo

Start your fact-finding journey

Start your
fact-finding journey