AI companies are finally being forced to cough up training data

But there is a problem. AI companies have scoured the Internet for training data, and many websites and dataset owners have begun to limit the scraping ability of their sites. We’ve also seen backlash against the AI industry’s practice of indiscriminately collecting online data, in the form of users opting out of their data for training and lawsuits from artists, writers and New York Timesclaiming that AI companies acquired their intellectual property without consent or compensation.

Last week, three major record labels – Sony Music, Warner Music Group and Universal Music Group – announced that they were suing AI music companies Suno and Udio for alleged copyright infringement. The record labels claim that the companies used copyrighted music in their training data “on an almost unimaginable scale”, allowing AI models to create songs that “mimic the properties of genuine human recordings”. My colleague James O’Donnell breaks down the lawsuits in his story and points out that these lawsuits could define the future of music AI. Read it here.

But this moment also sets an interesting precedent for all genetic AI development. Thanks to the lack of high-quality data and the enormous pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry lawsuit is sending the loudest message yet: High-quality training data isn’t free.

It will likely take a few years at least to get legal clarity on copyright law, fair use, and AI training data. But the cases are already introducing changes. OpenAI has entered into agreements with news publishers such as PoliticoThe Atlantic, yearThe Financial Times, and more, and trading publishers’ news files for money and mentions. And YouTube announced in late June that it would offer licensing deals to major record labels in exchange for music for education.

These changes are a mixed bag. On the one hand, I worry that news publishers are making a Faustian bargain with AI. For example, most of the media houses that have made deals with OpenAI say that the deal states that OpenAI cites its sources. But language models are fundamentally incapable of being factual and are best at making things up. Reports have shown that ChatGPT and AI-powered search engine Perplexity often show hallucinations, making it difficult for OpenAI to deliver on its promises.

It’s tough for AI companies, too. This shift could lead to the manufacture of smaller, more efficient models, which are much less polluting. Or they might spend a fortune to access data at the scale they need to build the next big thing. Only companies that are richer in cash and/or with large existing datasets of their own (like Meta, with two decades of social data), can afford to do this. So the latest developments risk concentrating power even more in the hands of the biggest players.

On the other hand, the idea of introducing consent into this process is a good one — not just for rights holders, who can benefit from the AI boom, but for all of us. We should all have the agency to decide how our data is used, and a fairer data economy would mean we could all benefit.

Deeper Learning

How AI video games can help unlock the mysteries of the human mind