The AI legal research startup Descrybe today launched a “legal reasoning” product, DescrybeLM, that it says outperforms leading general-purpose AI models on a standardized legal reasoning benchmark — and it is publishing the methodology and scoring data to invite scrutiny.
The company also launched an all-new website that features the new product while also retaining all the functionality of its prior “Legal Research Toolkit,” which includes tools for conducting legal research by concept, keyword, case name, citation, and legal issue.
As the company says, DescrybeLM and the Legal Research Toolkit are “built to work together,” with the latter used to find the relevant law that bears on a question and the former then enabling users to reason through it against the specific facts of the matter at hand.
Benchmarking Against General AI
The company tested its new system against ChatGPT 5.2, Claude Opus 4.5 and Gemini 3 Pro on 200 questions from the National Conference of Bar Examiners MBE Complete Practice Exam. DescrybeLM answered all 200 correctly. The general-purpose models each missed between 13 and 23 questions, achieving accuracy rates ranging from 88.5% to 93.5%.
Rubric-scored reasoning quality — a separate measure evaluating whether systems correctly identified governing legal rules and applied them to the facts — followed a similar pattern. DescrybeLM scored 99.70% on that dimension. ChatGPT 5.2 scored 93.41%, Gemini 3 Pro scored 91.45%, and Claude Opus 4.5 scored 89.03%.
A central fcous of the study was not just whether AI systems get legal questions wrong, but rather how they get them wrong. Among the 52 incorrect outputs produced by the three general-purpose models, 49 were flagged as “confidently wrong” — assertive, fluent, well-structured responses that gave no signal of uncertainty. The dominant failure patterns were applying the wrong legal standard to the facts, or applying the correct standard incorrectly.
“When these systems were wrong, they were confidently wrong,” the benchmarking study said. “Among the 52 total incorrect outputs, the dominant failure patterns applied the wrong legal standard or misapplied the correct one, while presenting the analysis in fluent, well-structured prose. These are precisely the errors that impose the highest verification burden on practitioners.”
The study also found that cross-checking between general-purpose models is an unreliable safeguard. Across the 200 questions, 40 were missed by at least one of the three models, but only one question was missed by all three. Because errors were largely non-overlapping and unpredictable, model disagreement does n0t reliably identify which output is correct, it only signals that verification is needed.
The scoring log found that two general-purpose models — Claude Opus 4.5 and Gemini 3 Pro — were flagged for overconfidence on correct outputs as well as incorrect ones. Claude Opus 4.5 received three overconfidence flags total, one on a wrong answer and two on correct ones. Gemini 3 Pro received one overconfidence flag on a correct answer. ChatGPT 5.2 and DescrybeLM received zero overconfidence flags across all 200 outputs.
The study interprets this as a model-level stylistic tendency, not simply a byproduct of being wrong. A system that applies the same assertive tone regardless of whether its answer is correct, the white paper argues, gives legal practitioners less reliable signal from output confidence alone.
“We had a thesis that purpose-built legal AI produces meaningfully different results for legal reasoning tasks,” said Kara Peterson, Descrybe’s cofounder and CEO. “Legal professionals deserve to make tool decisions based on real evidence, which can be hard to find. So, we tested ourselves.”
Peterson said she understands that vendor-produced benchmarks invite scrutiny. “That’s why we published our methodology and invite anyone to replicate it,” she said.
What’s Behind DescrybeLM
Descrybe describes the focus of this study, its new DescrybeLM, as a legal reasoning engine and drafting workspace. It enables users to receive authority-grounded analysis to complex legal questions and then to refine that analysis through clarifying follow-ups.
(Descrybe invited me to test the new product in advance of today’s launch. Because of my own tight schedule, I was unable to do so, but I still plan to at a later point and will report back when I do.)
It is built, the company says, on a curated primary law corpus of more than 100 million structured records, processed at a scale requiring more than 100 billion tokens of preparation. The system is designed to produce verification-friendly outputs that include clear rule statements, application to key facts, and structured reasoning.
You can see it demonstrated in this video:
“Most AI tools are built for general use and adapted for law,” said Richard DiBona, cofounder and CTO. “DescrybeLM was built differently: from the foundation up, specifically for legal reasoning, on more than 100 million structured records individually cleaned and organized for that purpose.”
In the study, the company chose to benchmark only against general-purpose models rather than other AI platforms specifically built for legal research. I asked Peterson why that was.
She explained that the company’s central question was what distinguishes foundation models from purpose-built tools, and that the best way to test that thesis was to put themselves directly in comparison against those models. Peterson emphasized that she strongly encourages other legal AI vendors to run the same benchmark using the same methodology.
Caveats the Company Itself Raises
Descrybe is transparent about the possible limitations of its own study, which it spells out in the report.
Most notably, the company says that it cannot rule out that some or all of the 200 questions appeared in the training data of the evaluated systems, including its own DescrybeLM. The MBE Complete Practice Exam is a commercially available product. That caveat applies equally to all systems tested, but a perfect score on a published question set will invite more scrutiny than a score of, say, 93%.
The study also discloses that fine-tuning performed on DescrybeLM before the benchmark was conducted on a separate NCBE product, not the question set used in the evaluation, and that the NCBE answer key was not provided to the system during testing.
Among other limitations: the benchmark covers only multiple-choice legal reasoning and does not test drafting, jurisdiction-specific research, citation accuracy or other real-world legal workflows. The evaluation was conducted entirely by the team that built DescrybeLM. Scoring was executed by an AI judge model — specifically, GPT-5.2 extra high reasoning, which is in the same model family as one of the evaluated systems. And each question was run only once per system, meaning the results do not come with confidence intervals or variance data.
On the rubric design, the company acknowledges the possibility of unconscious bias favoring DescrybeLM’s output style, but says it mitigated that risk through three measures: the rubric was authored by a human subject-matter expert and pre-committed before scoring began; all outputs were anonymized before the judge model applied the rubric; and the same rubric, judge prompt, and settings were applied identically across all 800 outputs.
DescrybeLM’s perfect accuracy score did not imply perfect reasoning. Twenty-seven of its outputs received rubric scores below 100, mostly for incomplete distractor discussion or alternative doctrinal framing — that is, reaching the correct answer via a different but defensible legal standard than the one emphasized by the reference answer. No DescrybeLM output was flagged for wrong rule, misapplied rule, misread key fact, or internal contradiction.
Inviting Replication
As noted, the company has published its full methodology, scoring rubric, standardized prompt and a per-output scoring log covering all 800 model outputs across the four systems. The NCBE question set is commercially available, and Descrybe says any researcher or vendor can purchase it and independently replicate the benchmark.
Because of model non-determinism and ongoing provider updates, exact numerical replication is unlikely, the company says, but directional replication, including rank ordering and approximate accuracy ranges, is the expected standard.
“I’ve worked in legal technology for a long time,” said Ken Friedman, founder of the legal tech practice group at law firm L&F Brown and a strategic advisor to the company. “It’s rare to see something that genuinely stops you in your tracks. When I saw DescrybeLM answer all 200 multistate bar exam questions correctly while ChatGPT, Claude and Gemini each missed double digits, that’s exactly what happened.”
Here is the full white paper: Beyond Confidently Wrong: How Purpose-Built AI Mitigates Legal Reasoning’s Hidden Risk.
Robert Ambrogi Blog
