BLOG
Industry Insights

GPT-5.1: Clear Gains in Contract Redlining Performance

November 13, 2025
Daniel Lewis
,
US CEO

We’re excited to share early results from LegalOn’s evaluation of GPT-5.1. Our team assessed GPT-5.1’s performance against our Contract Review Benchmark — an evaluation framework we’ve built for measuring model performance in core contracting tasks such as issue spotting, redlining, and answering contract-related questions.

GPT-5.1 Delivers Significant Gains in Contract Redlining Accuracy 

We tested approximately 300 prompts across English and Japanese to see how well our AI can propose contract revisions across six common issue types: liability, indemnity, security, IP, termination, and confidentiality. In practice, this means evaluating whether the AI’s suggested new language is something a lawyer would accept or flag for revision.

In this evaluation of our English language Contract Review Benchmark, GPT 5.1 showed substantial improvements in AI-driven contract revisions across a series of controlled tests. The model’s win rate reflects how often its suggested revisions aligned more closely with answers from our experienced in-house attorneys than the competing model’s output.

  • Against GPT-5: GPT-5.1 achieved a 67% win rate, making it more than twice as likely to produce superior revisions.
  • Against GPT-4.1: GPT-5.1 achieved a 57% win rate in direct redlining comparisons.

What this means: GPT-5.1’s gains demonstrate real progress toward more precise, context-aware contract edits that align better with the revisions a lawyer would provide. 

Why this matters: Redlining is one of the most time-consuming legal tasks. Accurate, lawyer-like revisions from AI mean faster negotiations, fewer rounds with counterparties, and more time for strategic work.

Beyond Redlines: GPT-5.1’s Broader Contracting Performance 

For other contracting use cases, GPT 5.1’s performance was roughly flat on issue spotting compared to previous models and scored slightly worse on a range of conversational tasks for legal assistance, such as answering questions, summarizing changes, and more.

However, processing speed improved by approximately 30% across all contracting tasks, with the most noticeable gains in generative tasks like redlining and legal AI assistance, creating a faster and smoother user experience. 

What this means: GPT-5.1 improves in some areas but not others, a common pattern in new model releases. We expect its conversational capabilities to continue improving with future iterations. 

Why this matters: Legal teams shouldn’t assume the newest model is always better—evaluate models based on how they fit to your specific use cases and workflows. Benchmarks like these help identify the right model for the right task.

What’s Next?

While these findings reflect base model evaluations, they highlight continued momentum in the foundational technology driving AI contract review. The results also serve as a strong reminder that model iteration matters.

We’ll share additional insights as GPT-5.1 becomes generally available and as our broader benchmarking continues across other contract-related tasks. 

Related Posts

View all
Industry Insights
October 27, 2025
AI Adoption 101: How to Treat AI Like a New Colleague on Your Legal Team
Company Updates
October 23, 2025
Always On: A New Chapter for LegalOn
Company Updates
July 24, 2025
LegalOn Raises $50M Series E Led by Goldman Sachs Growth Equity
View all

Experience LegalOn Today

See how LegalOn can save you time, reduce legal risk, and free you from tedious work.
Book a Demo