

We’re excited to share early results from LegalOn’s evaluation of GPT-5.1. Our team assessed GPT-5.1’s performance against our Contract Review Benchmark — an evaluation framework we’ve built for measuring model performance in core contracting tasks such as issue spotting, redlining, and answering contract-related questions.
We tested approximately 300 prompts across English and Japanese to see how well our AI can propose contract revisions across six common issue types: liability, indemnity, security, IP, termination, and confidentiality. In practice, this means evaluating whether the AI’s suggested new language is something a lawyer would accept or flag for revision.
In this evaluation of our English language Contract Review Benchmark, GPT 5.1 showed substantial improvements in AI-driven contract revisions across a series of controlled tests. The model’s win rate reflects how often its suggested revisions aligned more closely with answers from our experienced in-house attorneys than the competing model’s output.
What this means: GPT-5.1’s gains demonstrate real progress toward more precise, context-aware contract edits that align better with the revisions a lawyer would provide.
Why this matters: Redlining is one of the most time-consuming legal tasks. Accurate, lawyer-like revisions from AI mean faster negotiations, fewer rounds with counterparties, and more time for strategic work.
For other contracting use cases, GPT 5.1’s performance was roughly flat on issue spotting compared to previous models and scored slightly worse on a range of conversational tasks for legal assistance, such as answering questions, summarizing changes, and more.
However, processing speed improved by approximately 30% across all contracting tasks, with the most noticeable gains in generative tasks like redlining and legal AI assistance, creating a faster and smoother user experience.
What this means: GPT-5.1 improves in some areas but not others, a common pattern in new model releases. We expect its conversational capabilities to continue improving with future iterations.
Why this matters: Legal teams shouldn’t assume the newest model is always better—evaluate models based on how they fit to your specific use cases and workflows. Benchmarks like these help identify the right model for the right task.
While these findings reflect base model evaluations, they highlight continued momentum in the foundational technology driving AI contract review. The results also serve as a strong reminder that model iteration matters.
We’ll share additional insights as GPT-5.1 becomes generally available and as our broader benchmarking continues across other contract-related tasks.