Legal professionals are struggling to evaluate the rapidly evolving landscape of generative artificial intelligence tools for legal research, according to experts who spoke at a panel discussion during the American Association of Law Libraries annual conference in Portland, Ore., this week.
The panel, titled “AI in Legal Research: Measuring What Matters with Benchmarks and Rubrics,” brought together three professionals working on the front lines of AI evaluation: Sean Harrington, director of technology innovation at the University of Oklahoma College of Law; Cindy Guyer, practice innovation attorney at O’Melveny & Myers; and Nick Hafen, head of legal technology education at BYU Law School.
The panel was moderated by Debbie Ginsburg, faculty services manager at Harvard Law School’s library.
The Benchmarking Challenge
Benchmarking — which involves comparing software tools against a standard for consistency — faces unique obstacles in the legal field, the panelists said. Unlike hardware benchmarks that can measure battery life or processing speed with clear metrics, legal AI tools present complex challenges.
“Legal questions are not always easily broken down to correct answers,” Hafen explained. “We are always arguing about what the correct answer is.”
The challenges extend beyond methodology. Harrington noted that vendors typically don’t provide backend access to their systems, because they view it as their “secret sauce.” Modern legal AI platforms often use combinations of models rather than relying on a single large language model, making direct comparisons difficult.
“Frequently they’re using combinations of models,” Harrington said. “When these things first launched, it would be that they would call the GPT-4 API and that was the engine behind it. Now they’re all using mixtures of models.”
Recent Studies and Their Limitations
The panel examined several recent attempts to benchmark legal AI tools, each with distinct approaches and limitations:
The Stanford Study: Published as “Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools,” this study tested gen AI tools from LexisNexis and Thomson Reuters. It found hallucination rates of 17 to 33 percent, but faced criticism for its methodology, particularly its use of Westlaw Practical Law for tasks it wasn’t designed to handle. In a redo of the study in response to the criticism, Lexis+ AI correctly answered 65% of their queries, while Westlaw’s AI-Assisted Research was accurate only 42% of the time, and Westlaw was found to hallucinate at nearly twice the rate of the LexisNexis product — 33% for Westlaw compared to 17% for Lexis+ AI.
Academic Research on Legal Hallucinations: A study titled “Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models” tested GPT-4, GPT-3.5, PaLM 2, and Llama 2 across 1,000 randomly selected federal cases. It reported hallucination rates of 58 percent overall, rising to 88 percent for Llama 2, but was also criticized for methodology issues.
Vendor Self-Studies: Both Harvey AI and Paxton AI produced their own studies with significantly more favorable results. Harvey claimed 74 percent human-level output on legal tasks, while Paxton reported a 94.7 percent accuracy rate. However, the panelists noted these studies were essentially “scorecards” created by the vendors themselves.
The VALS Study: The most rigorous effort to date, according to the panelists, came from VALS (Verification and Assessment of Legal Solutions), which examined seven different legal tasks using actual law firm workflows. This study worked with multiple law firms to obtain real-world tasks and included human baselines using contract attorneys.
The “AI Smackdown” Experiment
Guyer described her participation in an “AI Smackdown” conducted for the Southern California Association of Law Libraries (which I covered here). Three law firm librarians tested three popular AI research platforms –Lexis Protégé, Westlaw Precision, and vLex – using identical prompts across federal and California state law questions.
The experiment evaluated tools based on six factors: accuracy, depth of analysis, citation of primary and secondary sources, format readability, and iterative capabilities. Guyer said that the results revealed significant variations in performance and highlighted trust issues when tools provided irrelevant results.
“When you’re looking at an answer, it really needs to be relevant or at least relevant adjacent,” Guyer said, adding that some partners at her firm have abandoned tools after receiving poor results.
Implementation Challenges
The panel said that adoption rates remain low even at firms with expensive AI subscriptions. Harrington cited a colleague at an Am Law 10 firm who reported that only 10 percent of attorneys use their AI tool, with just 2 percent being power users.
“The rest of the people just feel like they can’t rely on it,” Harrington explained, questioning whether firms can continue to justify the expense if confidence remains low.
Guyer noted that her firm created a practice innovation billing code to incentivize attorneys to participation in the evaluation of AI tools, in the belief that attorneys need dedicated time and resources to properly test these systems.
Practical Recommendations
For organizations looking to conduct their own evaluations, the panelists offered several recommendations:
- Focus on specific use cases rather than attempting comprehensive testing.
- Start simple with basic comparisons and iterate over time.
- Accept that any benchmarking is a snapshot in time given the rapid pace of change.
- Test with realistic prompts (such as those a first-year association might use) rather than “gold standard” questions that may not reflect actual usage.
- Involve actual users in the evaluation process despite time constraints.
Looking Forward
The panelists emphasized the need for greater vendor transparency and potentially industry-standard benchmarks similar to SOC 2 reports in cybersecurity. However, they acknowledged that such changes would likely require pressure from large law firms rather than individual institutions.
“It’s going to be the Kirkland Ellis’s of the world,” Harrington predicted, referring to the large firms that might have sufficient leverage to demand better transparency and reliability standards.
While current benchmarking efforts have limitations, the panel agreed, they represent important first steps in bringing rigor to AI tool evaluation. As the technology continues to rapidly evolve, the panel concluded, the legal profession must develop better frameworks for assessment while managing the inherent challenges of evaluating “black box” systems that vendors are reluctant to fully disclose.