BLOG
Industry Insights

Smarter Contract Review with GPT-5.4

March 18, 2026
Gabor Melli
,
VP of Artificial Intelligence

At LegalOn, our data scientists continuously evaluate every major model release and deploy the best models for the right use cases. We benchmark each one against our Contract Review Benchmark, so your team doesn't have to. Not every model improvement translates to better contract review performance, and knowing the difference is what we do.

Our benchmark is built around the real guidelines legal teams actually apply. Today we're sharing results for contract review performance — how accurately each model identifies whether a contract meets or fails your playbook guidelines — across 494 decisions covering NDAs, MSAs, BAAs, clinical trial agreements, and commercial leases.

Here's what we found when we ran GPT-5.4 against its predecessor, GPT-5.2.

GPT-5.4 vs. GPT-5.2: Contract Review Performance Results 

  • Overall accuracy: 79.4% vs. 73.9%; a meaningful +5.5pp improvement
  • GPT-5.4 cuts total errors by 21%. The gain is broad-based: every contract type improved, and 16 of 26 guidelines improved.
  • Both precision and recall improved, meaning fewer false alarms and fewer missed violations simultaneously:
    • False alarms: 41 vs. 53; 12 fewer unnecessary flags
    • Missed violations: 61 vs. 76; 15 fewer missed issues
  • Total errors reduced by 21%, from 129 down to 102
  • Improvement is consistent across all five contract types, with the largest gains on NDAs (+10pp) and MSAs (+8pp)
  • 3 guidelines regressed slightly, all in clinical trial agreements
  • Speed: 3.1s vs. 2.8s per contract; negligible difference

In short: GPT-5.4 is a genuine upgrade for contract review. The improvement isn't driven by one contract type or one clause category; it's consistent across the benchmark, with the biggest gains in agreement structure, obligation detection, and clause scope. The exception is clinical trial agreements, where two guidelines declined slightly and two others remain at near-coin-flip accuracy for any current AI model. Both models were tested under identical naive conditions — no custom prompts, no fine-tuning. 

LegalOn significantly outperforms general-purpose AI on contract review performance. And while today's general-purpose models still have room to grow, the trajectory is clear: specialized legal AI is already accurate enough to meaningfully change how in-house teams work.

What’s next?

We'll compare GPT-5.4 against our full model leaderboard, including Claude and Gemini, and share how it stacks up on LegalOn's issue-spotting accuracy.* We'll also publish more about how the Contract Review Benchmark is built and why we think task-specific benchmarks are the only honest way to evaluate AI for legal work.

Follow LegalOn for updates.

*Results here reflect contract review accuracy on guideline compliance. Redlining and AI Assistant results coming separately.

Credits: Gabor Melli, Deddy Jobson, Sonny Chee, and Petrie Wong

Related Posts

View all
Product Updates
March 4, 2026
From First Draft to Signature: Translation That Updates With Every Redline
Product Updates
February 25, 2026
Fixing In-House Legal’s Biggest Hidden Bottleneck
Product Updates
February 12, 2026
We Built AI Agents to Do the Work That Slows Legal Teams Down
View all

Experience LegalOn Today

See how LegalOn can save you time, reduce legal risk, and free you from tedious work.
Book a Demo