.webp)
.jpeg)
This article first appeared in Today's General Counsel.
Every few weeks, a new AI model launches with a press release full of benchmark numbers. Legal media reports that it scores higher on reasoning tests, handles longer context windows, and generates output faster than others in the market. And somewhere in your organization, someone forwards the article with a question: Should we switch?
Most of those teams don’t have the wrong model. They have the wrong evaluation framework. They selected legal AI based on model hype rather than team need, building procurement decisions on general capability benchmarks that have almost no bearing on how in-house legal actually works.
Legal work, particularly contract review, is not a general task. It requires applying nuanced legal standards—your standards, your positions, your risk tolerances—to complex documents, clause by clause. A model that generates plausible legal language is not the same thing as a system that reliably identifies whether your indemnification clause meets your approved fallback position.
These are different things. Missing them has a real legal and financial impact.
Measure what matters, or don’t bother measuring.
Benchmarks aren’t the problem—bad benchmarks are. A test that measures how well AI thinks or writes tells you exactly that, and nothing more. It doesn’t tell you whether the AI can actually do legal work.
For legal AI specifically, the only benchmark that matters is one built around real contracts, reviewed against the standards a practicing lawyer would actually apply, and common sense. If it’s not testing that, it’s not testing anything useful.
General-purpose AI benchmarks measure general-purpose capability. They test reasoning, math, coding, and language comprehension. They are designed to be broadly comparable across many types of tasks.
They are not built for predicting how an AI tool will perform on specific legal contracts. That’s why model improvements of general-purpose models do not reliably translate to better performance on clause-level legal analysis.
Some of the provisions where AI struggles most are the provisions that carry the most legal risk, like assignment rights, protected health information (PHI) ownership, or limitation of liability carve-outs. These require understanding conditional logic across multiple contract sections, not just pattern recognition on a single clause. A model that scores well on a reading comprehension benchmark can still fail systematically on these tasks.
If you are evaluating legal AI tools based on which model they run, you are not evaluating what matters.
The right starting point for any legal AI evaluation is a clear-eyed assessment of your own team’s needs.
For some, the constraint is volume. There are simply more contracts coming in than the team can review at its current pace. For others, it’s consistency. The team has standards, but those standards are not being applied consistently across all reviewers and contract types. Or maybe it’s coverage. The team can quickly and easily review routine contracts, but escalates everything else to outside counsel.
Each of these scenarios calls for a different evaluation. A team with a volume problem needs to test AI against realistic contract loads and measure how much attorney time is saved and how much is required to validate outputs. A team with a consistency problem needs to test how well the AI enforces specific playbook standards, especially when it comes to nuanced positions. A team with a coverage problem needs to assess jurisdiction-specific accuracy, not just English-language common-law performance.
None of these evaluations begins with asking which model a tool uses. That comes after you understand what you’re really trying to measure.
There is a particular risk for legal teams that have developed a habit of chasing the newest model: it introduces instability into workflows that depend on reliability.
In legal review, frequent tool changes are costly. The team develops judgment about how to interpret AI output, playbooks are tuned to a specific system’s behavior, and attorneys know where the tool is reliable and where it requires closer review. When you switch tools or models frequently, you reset that accumulated knowledge and introduce new uncertainty into the review process.
The cost of switching legal AI is primarily operational. Constantly re-evaluating tools in response to model release cycles means a team hasn’t committed to building AI into its actual workflows, leaving it in evaluation mode indefinitely.
To get meaningful results from legal AI, first define what is needed, then find a tool that meets that definition to build into standard workflows.
If you are beginning or re-evaluating your legal AI procurement, here’s a practical framework for evaluating legal AI tools based on your team’s needs.
The right tool doesn’t just process contracts faster. It enforces your standards, every time, across every reviewer on your team.
Don’t ask about the newest model—newer isn’t always better. The question your team should be asking is which system is most reliable for the specific work your team does, and which provider is most committed to maintaining that reliability over time.
The AI landscape will keep moving. New models will keep launching, and benchmarks will keep climbing. None of that changes the underlying evaluation framework for in-house legal teams.
Your team needs AI that applies your standards accurately, validates efficiently, integrates with your workflows, and protects your data. The model powering that AI is a means to those ends. Get the evaluation criteria right, then the model question largely takes care of itself.
Stop chasing recency and start building for fit.