BLOG
Industry Insights

Gemini 3 Raises the Bar on Quality, But Not on Speed

December 8, 2025
Eileen Policarpio
,
Communications Manager

LegalOn regularly evaluates new AI model releases against our Contract Review Benchmark—an evaluation framework we’ve built to measure model performance across core contracting tasks such as issue spotting, redlining, and answering contract-related questions.

Our latest evaluation focused on Gemini 3, comparing it against GPT-5.1 and Claude Sonnet 4.5 across both English and Japanese tasks. While Gemini 3 showed strong competitive quality, winning more head-to-head comparisons than any other model, its performance varied by task type and was accompanied by notably higher latency.

Ultimately, no model emerged as universally dominant, as each model excelled in different parts of the contracting workflow.

English: Gemini 3 Leads More Tasks, GPT-5.1 Best for Precision and Speed

Across the English language evaluations, GPT-5.1 and Gemini 3 each lead in different areas of contract work. Gemini 3 tends to perform better on tasks that rely on structured reasoning and consistent rule application:

  • More accurate across general legal AI skills, such as summarization, extraction, and translation, outperforming GPT-5.1 by three to six percentage points from skill selection through final output quality.
  • In playbook rule enforcement, Gemini 3 performs better on first-party contracts (about five points higher) and is effectively tied with GPT-5.1 on third-party contracts.
  • In revision tasks, Gemini 3 is stronger on first-party contract revision, winning roughly 70% of comparisons versus 30% for GPT-5.1.

Meanwhile, GPT-5.1 performs better on tasks driven by identifying risks or revising risk-related language in third-party paper:

  • In contract issue spotting, GPT-5.1 performs slightly higher than Gemini 3 by less than a point.
  • In revision tasks, GPT-5.1 leads in issue-driven revision and in third-party contract revision under playbook rules, winning each around 60% vs. 40% versus Gemini 3.

In addition to accuracy, GPT-5.1 is 2–4× faster than Gemini 3 across all tasks, providing a significant speed advantage.

Together, these results suggest that the two models lean toward different strengths:

  • Gemini 3 performs particularly well on structured, rule-based workflows and revisions to a party’s own templates.
  • GPT-5.1 tends to perform better on tasks that rely on identifying legal risks or making revisions informed by those risks, and importantly, performs much faster across all tasks

Both models show strong performance overall, but excel in different parts of the contract workflow.

Japanese: Claude and Gemini Lead the Way

In the Japanese language evaluation, model performance shifts noticeably compared to English. 

Claude Sonnet 4.5 is the strongest performer in contract issue spotting and playbook evaluations, while Gemini 3 outperforms all models across revision categories. By contrast, GPT-5.1—despite its strong English language performance—does not secure any category wins in Japanese.

In short:

  • Claude leads the highest-precision legal tasks.
  • Gemini 3 leads in rewriting and revision-focused tasks.

Overall, Claude is the top performer for accuracy-sensitive evaluations, while Gemini 3 is better suited for stylistic clarity and contract-ready revisions. 

What This Means for Legal Teams

For legal teams, the results reinforce that there is no single “best” model—only the best model for each specific task. Different parts of the contracting workflow demand different strengths, and model performance varies accordingly. 

Gemini 3 stands out in areas that depend on structured editing, consistency, and orchestration, making it well-suited for tasks with multi-step instructions. GPT-5.1 is the strongest option for high-precision review and risk identification in English, while Claude Sonnet 4.5 performs best on Japanese legal work. 

Finally, latency remains an important practical factor. While Gemini 3 performs well in several categories, it is 2–4× slower in response time than the other models tested.

How We Test

Our benchmark measures how accurately AI models perform contract work. Lawyers create the “gold standard” answers for each task, and we score each model by how often it matches those expert responses. The evaluation covers four areas:

  1. General legal AI skills: choosing the correct legal action (such as summarizing, classifying, or extracting) and generating a high-quality answer.

  2. Contract issue spotting: identifying missing clauses, risks, or incorrect language.

  3. Playbook rule enforcement: applying detailed, client-specific rules to both first-party and third-party contracts.

  4. Contract revision tasks: rewriting contract language to correct issues or enforce playbook positions.

Each test run uses real contracts and curated prompts across more than 300 agreements (1.1M+ words in English and Japanese). Every output is reviewed by multiple lawyers to ensure accuracy.

What’s Next?

As new AI models are released, we will continue benchmarking them against real contracting tasks to understand where they excel and where they fall short. The pace of model improvement continues to accelerate, and these results highlight just how much model-to-model variation exists across markets, languages, and workflows.

Related Posts

View all
Company Updates
November 19, 2025
Building the Future of Legal AI with OpenAI
Industry Insights
November 13, 2025
GPT-5.1: Clear Gains in Contract Redlining Performance
Industry Insights
October 27, 2025
AI Adoption 101: How to Treat AI Like a New Colleague on Your Legal Team
View all

Experience LegalOn Today

See how LegalOn can save you time, reduce legal risk, and free you from tedious work.
Book a Demo