GPT-5 Speed vs. Smarts in Contract Review
.png)
Better Answers and Redlines, but at a Slower Pace
We’re pleased to introduce our Contract Review Benchmark, and share our initial evaluation of GPT-5’s performance compared to GPT-4.1. Our Contract Review Benchmark evaluates model performance in a variety of core contracting tasks, including spotting issues, making redlines, answering questions, and more. Initial evaluations may not reflect final performance results, because some model improvements become visible only with deeper prompt tuning.
We find that model performance varies by language, and we evaluate models in both English and Japanese. The results we’re sharing here are for our English language Contract Review Benchmark.
Initial Evaluation Results of GPT-5’s Performance vs. GPT-4.1
- Legal AI Assistant (includes a range of conversational tasks such as answering questions, summarizing changes, and more): 90% vs. 78% → a noteworthy improvement of 12%
- Contract Redlines: Net improvement of ~6% → a modest improvement
- Contract Issue Spotting: Roughly flat to slightly worse performance
- Longer response times across the board, sometimes 10x slower.
In short: GPT-5 improves in some areas but not others, and today the model is significantly slower than GPT-4.1. We expect response times to improve in coming months and, as noted, we expect some model improvements will become visible with more prompt tuning. We’ll be diving into low-latency configurations to seek out speed improvements and tuning prompts to drive better model performance.
What's Next?
We’ll share Japanese performance results in a separate update, share an evaluation of GPT-5 against our broader leaderboard of models, and share more information about our Contract Review Benchmark.