
LegalOn regularly evaluates new AI model releases against our Contract Review Benchmark—an evaluation framework we’ve built to measure model performance across core contracting tasks such as issue spotting, redlining, and answering contract-related questions.
Our latest evaluation focused on Gemini 3, comparing it against GPT-5.1 and Claude Sonnet 4.5 across both English and Japanese tasks. While Gemini 3 showed strong competitive quality, winning more head-to-head comparisons than any other model, its performance varied by task type and was accompanied by notably higher latency.
Ultimately, no model emerged as universally dominant, as each model excelled in different parts of the contracting workflow.
Across the English language evaluations, GPT-5.1 and Gemini 3 each lead in different areas of contract work. Gemini 3 tends to perform better on tasks that rely on structured reasoning and consistent rule application:
Meanwhile, GPT-5.1 performs better on tasks driven by identifying risks or revising risk-related language in third-party paper:
In addition to accuracy, GPT-5.1 is 2–4× faster than Gemini 3 across all tasks, providing a significant speed advantage.
Together, these results suggest that the two models lean toward different strengths:
Both models show strong performance overall, but excel in different parts of the contract workflow.
In the Japanese language evaluation, model performance shifts noticeably compared to English.
Claude Sonnet 4.5 is the strongest performer in contract issue spotting and playbook evaluations, while Gemini 3 outperforms all models across revision categories. By contrast, GPT-5.1—despite its strong English language performance—does not secure any category wins in Japanese.
In short:
Overall, Claude is the top performer for accuracy-sensitive evaluations, while Gemini 3 is better suited for stylistic clarity and contract-ready revisions.
For legal teams, the results reinforce that there is no single “best” model—only the best model for each specific task. Different parts of the contracting workflow demand different strengths, and model performance varies accordingly.
Gemini 3 stands out in areas that depend on structured editing, consistency, and orchestration, making it well-suited for tasks with multi-step instructions. GPT-5.1 is the strongest option for high-precision review and risk identification in English, while Claude Sonnet 4.5 performs best on Japanese legal work.
Finally, latency remains an important practical factor. While Gemini 3 performs well in several categories, it is 2–4× slower in response time than the other models tested.
How We Test
Our benchmark measures how accurately AI models perform contract work. Lawyers create the “gold standard” answers for each task, and we score each model by how often it matches those expert responses. The evaluation covers four areas:
Each test run uses real contracts and curated prompts across more than 300 agreements (1.1M+ words in English and Japanese). Every output is reviewed by multiple lawyers to ensure accuracy.
As new AI models are released, we will continue benchmarking them against real contracting tasks to understand where they excel and where they fall short. The pace of model improvement continues to accelerate, and these results highlight just how much model-to-model variation exists across markets, languages, and workflows.