Harvey, the legal AI company whose valuation recently hit $11 billion, recently released what it is calling the Legal Agent Benchmark, or LAB — an open-source evaluation framework designed to measure how well AI agents can perform extended, real-world legal work rather than the discrete reasoning tasks that have dominated legal AI benchmarks to date.
Announced May 6 in a post by Harvey researchers Niko Grupen, Gabe Pereyra (Harvey’s cofounder), and Julio Pereyra, the first version of LAB contains more than 1,200 tasks spanning 24 legal practice areas, graded against more than 75,000 expert-written rubric criteria. The code and a portion of the dataset are available on GitHub.
“The goal of LAB is to provide a clear picture of how agents can be deployed to support legal work in the real world,” the researchers write. “By articulating where agents can do all, some, or none of a task, LAB helps law firms measure the ROI of AI investments and where such investments can augment their teams’ work.”
Notably, Harvey is launching LAB without a leaderboard. The company says it will work with research partners over the coming weeks to produce baseline results and publish standards for normalizing submissions before any rankings appear.
“We’re intentionally launching LAB without a leaderboard because we expect the dataset to evolve over time and we want to work with the community to ensure results are clear and intuitive in how they convey agent performance,” Harvey says.
What LAB Tests
In creating LAB, Harvey says that existing legal AI benchmarks — including LegalBench, CUAD, LEXam, and Harvey’s own earlier BigLaw Bench — measure short-horizon reasoning, such as ability to read a contract, answer a question, compare cases, or analyze an argument. LAB is meant to measure something closer to the unit of work that actually gets delegated inside a law firm.
Each LAB task is structured around four elements that mirror an associate’s assignment:
- An instruction written as a partner-to-associate request — short (averaging 50 words) and framed as what’s needed rather than how to produce it.
- An environment built as a client matter, with a closed universe of documents that the agent must sort through. Materials include both relevant files and peripheral ones the agent has to learn to ignore.
- An output that has to be reviewable legal work product, not just an answer.
- Verification through expert rubrics that break the deliverable into atomic pass/fail criteria covering facts, conclusions, citations, severity ratings, recommendations, deadlines, dollar amounts, and formatting.
To illustrate the structure, Harvey uses a fictional corporate M&A example. It involves a $458 million all-equity acquisition of Crestview Software Solutions in which the agent must review a virtual data room containing eight material contracts plus adjacent documents such as a 10-K and a deferred compensation plan, identify change-of-control provisions across the matter, assess deal risk, recommend next steps, and produce a draft memorandum for the deal team and board. The rubric for that single task contains 57 criteria covering nine legal issues planted across the materials.
LAB uses what Harvey calls “all-pass” grading, meaning that a task is marked complete only if every rubric criterion passes. There is no partial credit. The rationale is that a deal memo that catches eight of 10 material risks is not 80% useful. One missed issue could blow up the transaction or surface as a problem post-closing.
The 24 practice areas in the initial release span transactional, advisory, regulatory and litigation work. Harvey says future versions will expand within those areas, add new practices, and eventually move beyond law firms to in-house legal work and adjacent professional services like asset management and banking.
Why a Benchmark?
Harvey’s thesis is that benchmarks have served as leading indicators of capability inflection points in other agentic domains — most visibly in software engineering, where benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 tracked the shift that AI researcher Andrej Karpathy summarized by saying coding agents “basically didn’t work before December and basically work since.”
Harvey argues that similar benchmarks (GDPval, OSWorld-Verified, BrowseComp, FinanceAgent, and others) are now extending legibility to knowledge work, web research, financial analysis and professional services.
Harvey positions LAB as the legibility layer for legal agents. The use case Harvey describes for law firms is straightforward: identify the workflows where agents perform well enough to be delegated under a “review pattern,” identify the workflows where they don’t and need to stay heavily human-in-the-loop, and make deployment and ROI decisions accordingly.
For most firms, that may matter more than technical details. The legal industry has spent two years cycling through vendor demos and pilot programs without a shared way to answer the question every managing partner and innovation lead is being asked, which is where, specifically, can we put these things to work?
A credible, public benchmark, particularly one structured around actual deliverables rather than multiple-choice questions, could change that conversation. Of course, it could also complicate it, by revealing how far agents still are from autonomous practice in many areas.
Practical Applications of LAB
To my mind, a few practical applications of LAB jump out:
- For law firms, LAB offers a reference point for vendor evaluation. A firm evaluating competing products could, in theory, ask each vendor to report performance on specific LAB practice areas and compare results, rather than rely on vendor demos and case studies.
- For vendors, LAB offers a public yardstick for claims about agent capability. Harvey has acknowledged contributions from a substantial list of labs and companies (including Anthropic, OpenAI, Nvidia, Google DeepMind, Mistral, LangChain, Fireworks, Snorkel, Mercor, and Stanford LIFTLab), which suggests the major frontier labs see value in a shared evaluation context for legal agents.
- For researchers, LAB provides a longer-horizon, domain-specific task set that they can use for evaluation, fine-tuning and post-training work.
- For legal journalists and analysts, LAB could provide something more useful than vendor-supplied claims about their products —a way of actually putting those claims to the test.
The Bottom Line
It is worth noting that LAB is a benchmark built by a market participant. Harvey is a dominant and well-funded legal AI vendor, and the company has not been shy about its commercial positioning.
The tasks and definitions of “legal work product” within LAB reflect choices about what good legal work looks like, and those choices were made by Harvey’s team in consultation with its research partners. None of that makes the benchmark unreliable, but it is something the legal community needs to keep in mind going forward.
There is also the question of what exactly is the impact of “open source” in this context. In a post at Alt-Counsel, Houfu Ang argues that legal open source is not really a community but rather “a federation of solo-author archipelagos.”
He points specifically to projects that come from well-funded vendors such as Harvey, whose repositories are maintained almost exclusively by in-house staff in what the Open Source Initiative calls “Open Source theatre.” Virtually none of these, Ang argues, graduate from individual showcase to sustained codebase with outside contributors.
Even so, LAB is the most ambitious public attempt yet to measure what legal AI agents can actually do on the kind of work law firms actually delegate. Whether it becomes the shared yardstick Harvey wants it to be will depend on how the leaderboard rolls out, how transparently submissions are normalized, and how much room the project leaves for outside contributors to shape what gets measured.
Robert Ambrogi Blog