GPT-5.5 Takes a Step Forward, But the Picture Is More Complicated Than It Looks

May 5, 2026

Gabor Melli

VP of Artificial Intelligence

With every major model release, the same question lands in our inbox: Is this one better for legal?

For GPT-5.5, the answer depends entirely on what you are asking it to do. We ran it against our contract revision and review benchmarks, and the results split in a way that is worth understanding before your team makes any deployment decisions.

Contract Revision: GPT-5.5 vs. GPT-5.4

Our AI Revise benchmark covers 200 English contract revision examples spanning NDAs, MSAs, and other commercial agreements.

Each revision is independently scored on three criteria:

Instruction adherence
Semantic preservation
Grammatical correctness.

A revision passes only if all three criteria are met.

GPT-5.5 vs. GPT-5.4: Contract Revision Performance Results

Overall accuracy: 87.5% vs. 85.0%; a meaningful +2.5pp improvement
Instruction adherence improved: 91.0% vs. 88.0%; +3.0pp. GPT-5.5 more reliably follows the revision instruction as given.
Semantic preservation was effectively flat: 87.5% vs. 88.0%; a slight -0.5pp change
Grammar remained near-perfect for both models: 99.5% vs. 100.0%
Speed: 2.65s vs. 3.04s per revision; 13% faster despite better output
Zero failures on both models across all 200 examples

Where GPT-5.5 is better: term revisions and liability edits

The improvement shows up most clearly in how each model handles common legal revision tasks. On term revisions, converting indefinite obligations into bounded durations, GPT-5.5 more reliably uses standard contract phrasing.

A representative example: when revising "shall continue indefinitely" to a fixed term, GPT-5.4 produced "until [X] years from the Effective Date" while GPT-5.5 produced "for [X] years from the Effective Date." That distinction matters. "For" correctly pairs with a duration and follows standard legal drafting conventions, while "until" typically refers to a point in time and reads less naturally.

TERM REVISION EXAMPLE

Original shall continue indefinitely

GPT-5.4 until [X] years from the Effective Date

GPT-5.5 for [X] years from the Effective Date

On liability edits, GPT-5.5 more often makes the requested change cleanly without introducing formatting noise or unnecessary rewrites, which makes the output easier for legal teams to accept with minimal cleanup.

Where GPT-5.4 still holds its own: contextual awareness

Where GPT-5.4 occasionally produces the better revision (roughly 1 in 16 examples), it tends to show slightly better contextual awareness.

On a confidentiality clause revision, for instance, GPT-5.4 integrated the fix within the existing sentence structure while maintaining focus on the section's specific obligations. GPT-5.5, by contrast, replaced the sentence entirely with generic agreement-term language, losing the contextual connection. That tradeoff — cleaner instruction-following at the occasional cost of broader document awareness, is where GPT-5.5 still has room to grow.

In short: GPT-5.5 is a meaningful upgrade for English contract revision. The improvement is driven primarily by stronger instruction adherence, while semantic preservation and grammar remain broadly stable. Both models were tested under identical low-reasoning settings with no custom prompts.

Contract Review: GPT-5.5 vs. GPT-5.4

Our contract review benchmark covers 494 guideline-contract decisions across 26 precision-critical guidelines, spanning BAAs, MSAs, NDAs, clinical trial agreements, and commercial leases.

The task is binary: for each guideline-contract pair, the model determines whether the contract meets or fails the requirement.

GPT-5.5 vs. GPT-5.4: Contract Review Performance Results

Overall accuracy: 77.1% vs. 80.2%; a -3.0pp regression
Precision declined: 80.5% vs. 84.8%; GPT-5.5 generates 34% more false positives than GPT-5.4
Recall was effectively flat: 77.2% vs. 77.9%
Total errors increased: 113 vs. 98; 15 additional errors
The regression is concentrated in BAA and MSA contract types; NDA and NNN performance held broadly stable
Enabling reasoning (reasoning_effort=low) does not recover accuracy. It stays at 77.1% while latency increases 4.6x.

The accuracy drop is driven primarily by an increase in false positives, cases where GPT-5.5 flags a compliant contract as non-compliant. Of 494 predictions, GPT-5.5 introduced 39 regressions against 24 improvements, for a net of 15 additional errors.

Here is where those errors are concentrated:

Top regressions

BAA: "Business Associate workforce" language (4 regressions, all false positives)

"The contract must include the following, or substantially similar, language: 'Business Associate, along with its Workforce members...'"

GPT-5.5 becomes overly strict on "substantially similar" matching, flagging contracts that GPT-5.4 correctly recognized as compliant. This is the clearest example of the false positive pattern driving the overall regression.

NDA: No obligation to enter further agreements (3 regressions, all false negatives)

"Make sure there is language clarifying that the Agreement doesn't obligate the parties to enter into further agreements."

GPT-5.5 incorrectly marks these contracts as meeting the requirement when the clarifying language is actually absent. This is an absence-detection failure: the model reads past a gap that GPT-5.4 caught.

‍

Top improvements

BAA: Defined capitalized terms (3 improvements)

"All capitalized terms that are not proper nouns are defined in the agreement."

GPT-5.5 correctly identifies missing term definitions that GPT-5.4 overlooked, catching UNMET contracts that GPT-5.4 incorrectly passed.

NDA: Bilateral structure or disclosing party identified (3 improvements)

"The NDA must be bilateral or the disclosing party must be identified."

GPT-5.5 better detects whether the NDA structure is bilateral, both catching missing bilateral language and correctly recognizing it when present. This is one area where its stronger clause-detection ability comes through.

The improvement pattern shows GPT-5.5 is better at detecting certain absent clauses, but the gains are not enough to offset the increased false alarm rate, particularly in MSA and BAA contract types.

The pattern of newer OpenAI models not consistently improving on structured legal compliance tasks continues with GPT-5.5. GPT-5.5 sits between GPT-5.4 and GPT-5.2 in accuracy on this benchmark, above the earlier baseline but below the current best.

In short: GPT-5.5 is not an upgrade for contract review under naive conditions. It regresses 3pp from GPT-5.4, driven by increased false positives in BAA and MSA contract types. Enabling reasoning does not recover the gap and is not a useful trade-off in production. Both models were tested under identical conditions with no custom prompts or fine-tuning.

What's next?

We will compare GPT-5.5 against our full model leaderboard, including higher-reasoning configurations and the strongest current non-OpenAI models.

We will also publish more about how each benchmark is built, and why we believe task-specific evaluation is the only honest way to measure AI for legal work.

Follow LegalOn on LinkedIn for updates.

*Revision results reflect the AI Revise benchmark (n=200, low-reasoning settings, no custom prompts). Review results reflect a 26-guideline precision-critical subset (n=494 guideline-contract pairs, naive one-shot prompting). Both benchmarks use human-annotated ground truth.

Credits: Gabor Melli, Deddy Jobson, Sonny Chee, and Petrie Wong