How "available" is the public data that feeds AIs?

You’ve probably wondered where Artificial Intelligence (AI) companies get all that information to train their models. Their answer is: “It’s publicly available on the internet.” But what does that really mean? Let’s get to the bottom of this conundrum: “copyright vs. public AI data”—is it just a confusing euphemism? When AI companies mention that they use publicly available data, it may sound as if they have permission to use

Apr 12, 2024

Artificial Intelligence

You've probably wondered where Artificial Intelligence (AI) companies get all that information to train their models. The answer they give is: "is publicly available on the Internet". But what does that really mean? Let's get to the bottom of this "copyright vs public data AI" conundrum."

A confusing euphemism?

When AI companies mention that they use publicly available data, it can sound like they have permission to use that information. However, in many cases, it's more like saying "what you find, belongs to whoever finds it." According to some experts, this phrase is designed to confuse people.

Ed Newton-Rex (former Stability AI developer) explains it this way: "Publicly available, it doesn't mean that someone has given permission to use it to train an AI system. Essentially, all they're saying is: We haven't illegally hacked into a system."

The gray area of copyright

At first glance, publicly available sounds similar to "public domain," which refers to information that is no longer protected by copyright or has been made freely available. But in reality, much content is still subject to various protections, including copyright.

In fact, cases have been found where AI companies have used "pirated" content from websites known to distribute material without the creators' permission. One copyright lawyer, warns, "The receipt and subsequent improper commercial use of stolen property will not look good in front of a jury."

The hunt for quality data

As AI companies look to improve their models, the need for high-quality training data becomes crucial. Some companies are exploring the use of YouTube transcripts or even synthetic data generated by the AI itself.

However, this hunt for quality data also raises privacy questions. Information publicly available, but hidden in dark corners of the internet, could be circulated much more widely through an AI chatbot trained on that data.

The defense of AI companies

AI companies have two main legal arguments. First, they claim that their use of copyrighted material is covered by the "fair use" doctrine. Second, they argue that copyright is not an issue in AI training, since the systems do not copy the material, but "learn" from it, just as a human would.

However, these companies are often reluctant to disclose exactly what publicly available data they are using, describing it as a competitive trade secret.

So, copyright vs. public data AI

In short, the phrase "publicly available" can be a confusing euphemism that masks questionable practices in collecting data to train AI models. As technology companies seek quality sources of information, it is crucial that they do so with respect for copyright and privacy. The future of AI will largely depend on how this issue is addressed in the coming years.

Published in Artificial Intelligence

Sergio Silva

In the same category:

How "available" is the public data that feeds AIs?

A confusing euphemism?

The gray area of copyright

The hunt for quality data

The defense of AI companies

So, copyright vs. public data AI

Your company doesn't have an AI problem. It has a promise problem.

AI training and adoption in B2B companies: a practical guide for SMEs

Evaluating AI tools for B2B companies in 2026: how to decide what really adds value