Want to be as smart as Google’s BERT or Facebook’s LLaMA? Well then, you should keep reading this blog, as it was used to help train them.

With so much attention being paid to the current generation of AI trained on large language models, such as ChatGPT, most of us know little about the text used to train them.

Now, The Washington Post has lifted the cover off this black box. Working with the Allen Institute for AI, it analyzed Google’s C4 data set, “a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs,” including Google’s T5 and Facebook’s LLaMA.

It then categorized all of those websites (journalism, entertainment, etc.) and ranked them based on how many “tokens” appeared from each data set — with tokens being the bits of text used to process the disorganized information.

In addition to analyzing all these sites, it then created a searchable database of all the websites in Google’s dataset. As it turns out, this blog is one of them.

LawSites blog ranked 63,769 of all sites used to train the dataset, providing 290,000 tokens, or 0.0002% of all tokens in the dataset.

Of course, LawSites was hardly the only law-related site used to train the data. Based on searches for words such as law, legal, court and case, I found some of the other legal sites that were used. Here is a sampling, listed by their ranks:

(After publishing this post, it was pointed out to me that the data is broken down by subdomain. So, for example, at least three of the data sets all came from the same source, Justia. I added Justia’s patents and Supreme Court subdomains above. That would mean that, cumulatively, Justia contributed 92 million tokens, which would appear to make it the fifth largest source of data, just after the New York Times.)

You can go in and search for your favorite legal sites and see where they rank. But, clearly, the bottom line is that you should keep reading this blog.

Photo of Bob Ambrogi Bob Ambrogi

Bob is a lawyer, veteran legal journalist, and award-winning blogger and podcaster. In 2011, he was named to the inaugural Fastcase 50, honoring “the law’s smartest, most courageous innovators, techies, visionaries and leaders.” Earlier in his career, he was editor-in-chief of several legal publications, including The National Law Journal, and editorial director of ALM’s Litigation Services Division.