Launching today is the capstone to a massive project executed over the last three years to digitize all U.S. case law, some 6.4 million cases dating all the way back to 1658, a span of 360 years. The Caselaw Access Project site launching today makes all published U.S. court decisions freely available to the public in a consistent digitized format.

The site is the product of a partnership started in 2015 between Harvard Law School’s Library Innovation Lab and legal research service Ravel Law to digitize Harvard’s entire collection of U.S. case law, which Harvard says it the most comprehensive and authoritative database of American law and cases available anywhere outside the Library of Congress.

With financial support from Ravel — which continued after LexisNexis acquired it in 2017 — Harvard scanned 38.6 million pages from 39,796 books and converted it all into machine-readable text files. The collection includes 6.4 million published cases covering 627 reporter series, starting with the 1658 Maryland case William Stone against William Boreman and continuing through June 30, 2018.

The collection includes all federal and state courts, and all territorial courts for American Samoa, Dakota Territory, Guam, Native American Courts, Navajo Nation, and the Northern Mariana Islands. For now, the collection is text only, although Harvard plans to add images at a later time.

On Friday, I visited the Library Innovation Lab at Harvard Law School and met with Adam Ziegler, director, and Jack Cushman, senior developer, who have me a preview of the site.

Available in Two Forms

With today’s launch, Harvard is making this data available in two forms:

  • Via an API (application programming interface) called CAP API, which it says is the best option for anybody interested in programmatically accessing the metadata, full-text search, or individual cases.
  • Via bulk downloads, which are available to anyone for cases from Illinois and Arkansas and to non-commercial researchers for broader data collections.

The API is designed to make the data usable by programmers, so neither the front-end interface nor the display of the data are as graphically friendly as a typical case law research site. Data is displayed in a plain text format that includes metadata mark-up. Still, anyone can use the API to browse or search the collection. The site includes helpful descriptions of its data formats and even a beginner’s guide to APIs.

Wordcloud derived from California cases in 1856.

Bulk access to this data is limited by virtue of Harvard’s agreement with Ravel (and now LexisNexis). Under that agreement, LexisNexis retains control over the commercial use of this data through March 2024. Any company wishing to use it for commercial purposes would have to license it from LexisNexis. The agreement requires LexisNexis to offer the data commercially on reasonable terms, Ziegler said.

That commercial restraint does not apply to any jurisdiction that makes all of its law available online in a fully authoritative and citeable format. So far, only two states qualify — Illinois and Arkansas — which is why only their data is available for bulk download. But as soon as any other jurisdiction makes its law available in this way, then the Harvard data becomes available to anyone, without restriction.

That said, anyone can download up to 500 cases per day, and once a case is downloaded, it can be used for any purpose.

In addition, research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions.

For now, the collection does not include the scanned images of the cases, although it plans to add them eventually. Because of the sheer volume of scanned images — 200 TB — Harvard requires time and resources to compress them into a format suitable for the web.

All of these cases are also available for free through the Ravel site. As part of the agreement between Harvard and Ravel, Ravel committed that it would provide free access to the cases. Its site also includes images of the cases.

Usage and Future Plans

Both Ziegler and Cushman said that they look forward to seeing how researchers use this data. For now, they’ve posted some examples in a gallery on the site:

  • H20. H20 is a Library Innovation Lab project to enable law faculty to create open-licensed digital textbooks for free. The project uses the CAP API to source free case law.
  • Wordclouds. The project has created a set of wordclouds to show the most-used words in California cases from 1853 to 2015.
  • Limerick generator. A fun example that generates limericks from case law, with each line of the limerick a complete phrase from a different case, but pulled together to match the limerick rhyming scheme.

This capstone site of the Caselaw Access Project also has a GitHub page where anyone who is interested can follow the project’s development. Included there is a list of projects that are in progress to enhance the site. These show that next steps are to add page images for the scanned cases, an ngram API and frontend viewer, and a graphical user interface for viewing cases.

Now that this historical case law collection is complete, a top priority for Ziegler and his team is to encourage courts to move away from print-first publishing of their cases. As noted above, only two states have adopted digital-first publishing. But Ziegler hopes he doesn’t have to do any more scanning and reiterates that the commercial restrictions on this data end once a jurisdiction goes digital.

“A project like this should be unnecessary,” he said. “But many states are still putting stuff in books first.”

Read more:

Photo of Bob Ambrogi Bob Ambrogi

Bob is a lawyer, veteran legal journalist, and award-winning blogger and podcaster. In 2011, he was named to the inaugural Fastcase 50, honoring “the law’s smartest, most courageous innovators, techies, visionaries and leaders.” Earlier in his career, he was editor-in-chief of several legal publications, including The National Law Journal, and editorial director of ALM’s Litigation Services Division.