A new benchmark study released by Vals AI suggests that both legal-specific and general large language models are now capable of performing legal research tasks with a level of accuracy equaling or exceeding that of human lawyers.
The report, VLAIR – Legal Research, extends the earlier Vals Legal AI Report (VLAIR) from February 2025 to include an in-depth examination of how various AI products handle traditional legal research questions.
That earlier report evaluated AI tools from four vendors — Harvey, Thomson Reuters (CoCounsel), vLex (Vincent AI), and Vecflow (Oliver) — on tasks including document extraction, document Q&A, summarization, redlining, transcript analysis, chronology generation, and EDGAR research.
This follow-up study compared three legal AI systems – Alexi, Counsel Stack and Midpage – and one foundational model, ChatGPT, against a lawyer baseline representing traditional manual research.
All four AI products, including ChatGPT, scored within four points of each other, with the legal AI products performing better overall than the generalist product, and with all performing better than the lawyer baseline.
The highest performer across all criteria was Counsel Stack.
Leading Vendors Did Not Participate
Unfortunately, the benchmarking did not include the three largest AI legal research platforms: Thomson Reuters, LexisNexis and vLex.
According to a spokespeople for Thomson Reuters and LexisNexis, neither company opted into participating in the study. They did not not say why.
vLex, however, originally agreed to have its Vincent AI participate in the study, but then withdrew before the final results were published.
A spokesperson for vLex, which was acquired by Clio in June, said that it chose not to participate in the legal research benchmark because it was not designed for enterprise AI tools. The spokesperson said vLex would be open to joining future studies that fit its focus.
Overview of the Study
Vals AI designed the Legal AI Report to assess AI tools on a lawyer-comparable benchmark, evaluating performance across three weighted criteria:
- Accuracy (50% weight) – whether the AI produced a substantively correct answer.
- Authoritativeness (40%) – whether the response cited reliable, relevant, and authoritative sources.
- Appropriateness (10%) – whether the answer was well-structured and could be readily shared with a client or colleague.
Each AI product and the lawyer baseline answered 210 questions spanning nine legal research types, from confirming statutory definitions to producing 50-state surveys.
Key Findings
- AI Now Matches or Beats Lawyers in Accuracy
Across all questions, the AI systems scored within four percentage points of one another and an average of seven points above the lawyer baseline.
- Lawyers averaged 71% accuracy.
- Alexi: 80%
- Counsel Stack: 81%
- Midpage: 79%
- ChatGPT: 80%.
When grouped, both legal-specific and generalist AIs achieved the same overall accuracy of 80%, outperforming lawyers by nine points.
Significantly, for five of the question types, on average, the generalist AI product provided a more accurate response than the legal AI products, and one question type where the accuracy was scored the same.
“Both legal AI and generalist AI can produce highly accurate answers to legal research questions,” the report concludes.
Even so, the report found multiple instances where the legal AI products were unable to produce a response. This was due to either technical issues or deemed lack of available source data.
“Pure technical issues only arose with Counsel Stack (4) and Midpage (3), where no response was provided at all. In other cases, the AI products acknowledged they were unable to locate the right documents to provide a response but still provided some form of response or explanation as to why the available sources did not support their ability to provide an answer.”
- Legal AI Leads in Authoritativeness
While ChatGPT matched its legal-AI rivals on accuracy, it lagged in authority — scoring 70% to the legal AIs’ 76% average. The difference, Vals AI said, reflects access to proprietary legal databases and curated citation sources, which remain differentiators for legal-domain systems.
“The study outcomes support a common assumption that access to proprietary databases, even if composed mainly of publicly available data, does result in differentiated products.”
- Jurisdictional Complexity Remains Hard for All
All systems struggled with multi-jurisdictional questions, which required synthesizing laws from multiple states. Performance dropped by 11 points on average compared to single-state questions.
Counsel Stack and Alexi tied for best performance on these, while ChatGPT trailed closely.
- AI Excels at Certain Tasks Beyond Human Speed
The AI products outperformed the lawyer baseline on 15 of 21 question types — often by wide margins when tasks required summarizing holdings, identifying relevant statutes, or sourcing recent caselaw.
For example, AI responses were completed in seconds or minutes, compared to lawyers’ average 1,400-second response latency (~23 minutes).
And where the AI products outperformed the humans on individual questions, they did so by a wide margin – an average of 31 percentage points.
- Human Judgment Still Matters
Lawyers outperformed AI in roughly one-third of question categories, particularly those requiring deep interpretive analysis or nuanced reasoning, such as distinguishing similar precedents or reconciling conflicting authorities.
These areas underscore, as the report put it, “the enduring edge of human judgment in complex, multi-jurisdictional reasoning.”
Methodology
The study was conducted blind and independently evaluated by a consortium of law firms and academics.
Each participant answered identical research questions crafted to mirror real-world lawyer tasks. Evaluators graded every response using a detailed rubric (which the report includes).
The AI vendors represented were:
- Alexi – legal research automation startup (founded 2017).
- Counsel Stack – open-source legal knowledge platform.
- Midpage – AI research and brief-generation tool.
- ChatGPT – generalist large language model (GPT-4).
Vals AI cautioned that the benchmark covers general legal research only, not tasks such as drafting pleadings or generating formatted citations.
And, as the report notes, “Legal research encompasses a wide range of activities … but there is not always a single correct answer prepared in advance.”
Bottom Line
The VLAIR – Legal Research study reinforces what many in the legal tech industry have already observed, which is that AI systems – both generalist and domain-trained – are rapidly closing the quality gap with human legal researchers, particularly in accuracy and efficiency.
Yet, the edge remains with legal-specific AIs in trustworthiness and source citation, suggesting that proprietary data access is the next competitive frontier.
For law firms, corporate legal departments, and AI vendors alike, the study serves as a transparent benchmark – a rare apples-to-apples comparison — for understanding where today’s models shine and where human expertise remains indispensable.
Even so, the study is weakened by the failure of the three biggest AI legal research platforms to participate. This is not the fault of Vals AI, but it leaves one wondering why the big three all opted out.