A class action lawsuit was filed Tuesday against Google, its parent company Alphabet, and its artificial intelligence branch Google DeepMind for “secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans,” according to the complaint.
The class action lawsuit was filed in the US District Court for the Northern District of California by the Clarkson Law Firm on behalf of eight anonymous plaintiffs from across the United States. One is a New York Times bestselling author whose work was used to train Google’s AI-powered chatbot Bard; another is an actor who posts educational material online and believes her work was used to train Google products that will one day make her obsolete. Two of the plaintiffs are minors, 6- and 13 years-old respectively, whose guardians are concerned that their online activity is being tracked and harvested by Google, also for training purposes.
The lawsuit was in part triggered by a quiet update Google made to its privacy policy last week to make explicit that the company would be harvesting publicly available data to “build products and features” like Bard. That would include upcoming AI models that Google is developing, like Imagen, a text-to-image generative AI (similar to Midjourney); MusicLM, a text-to-music AI (Midjourney but for music); and Duet AI, an AI program to be embedded in Google Workspace apps to “aid” in drafting emails, preparing Slides presentations, and organizing meetings.
The plaintiffs, according to the complaint, took this privacy update as tacit admission that Google had been using this data all along for AI training purposes.
“All of the stolen information belonged to real people who shared it online for specific purposes, not one of which was to train large language models to profit Google while putting the world at peril with untested and volatile AI products,” Timothy K. Giordano, a partner at Clarkson Law, said in a statement to ARTnews. “‘Publicly available’ has never meant free to use for any purpose.”
The complaint further contextualizes that this is all happening in the context of Google employees, both former and current, repeatedly sounding the alarm on the dangers of AI technology and concerns over how quickly it is being developed. Additionally, the Federal Trade Commission is also beginning to warn companies about their web-scraping, which is what triggered Google’s new privacy policies in the first place, according to the complaint.
“We’ve been clear for years that we use data from public sources—like information published to the open web and public datasets—to train the AI models behind services like Google Translate, responsibly and in line with our AI Principles,” Halimah DeLaine Prado, Google’s general counsel, said in an emailed statement. “American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”
Meanwhile, companies like Twitter reacted to the news of Google’s new privacy policy by shifting their own standards of what is “publicly available” by limiting how many posts Twitter users can read a day in an effort to stymie web-scraping, Reuters reported earlier this month. It’s possible other websites will follow suit to protect the data and content of their users—whose information they may want to use for their own product development anyway.
This class action lawsuit differs from the many other lawsuits brought against companies like Google, OpenAI, and Meta, which have tended to focus on copyright violations. Be they artists, coders, or authors like actress and memoirist Sarah Silverman, these class action cases have concentrated on IP theft of protected materials like original creative and scientific work. This case, however, has taken a different course, using a variety of charges to argue that web-scraping “everything,” from user activity data to original art work to paywalled-content, shouldn’t be possible.
The complaint alleges violation of California’s Unfair Competition Law, negligence, invasion of privacy under the California Constitution, unjust enrichment, direct and indirect copyright violations, and other charges.
The charges do not directly mention laws around web-scraping as those are virtually nonexistent in the US. Similarly, there is almost no regulation on what kind of data companies are allowed to mine when developing research or products, even after scandals like Cambridge Analytica, which displayed how a political consulting firm could gain access to 87 million Facebook user’s data under the guise of conducting research. Instead, Cambridge Analytica used that data to impact the 2016 US Presidential elections and other elections worldwide.
States like California have some “data minimization” regulation on the books to discourage the collection of personal data, but the line between what is private and what is public on the internet has long been murky, allowing companies to act boldly in their web-scraping activities. Unlike Europe and the UK, the US has not yet produced any specific regulations on what kind of data can be used in AI research.
Some scholars believe that focusing on copyright when tackling the twin phenomenons of web-scraping and AI development is the wrong strategy, arguing that these issues should be viewed from a data governance perspective.
“The norm has been that data scraping is acceptable and that there should be a presumption for fair use when it concerns TDM [text and data mining] because not allowing that would hinder innovation,” said Mehtab Khan, a resident fellow at the Information Society Project at Yale Law School.
Khan refers to fair use clauses in copyright law that allow individuals (and by extension companies) to use protected, original material in specific cases, as fair use protects the right to learn from pre-existing works. While fair use tends to protect teachers, students, and artists, Khan believes that, in the absence of clear regulations on web-scraping, the assumption from companies is that as long as they’re researching and developing technologies, they more or less have carte blanche to “public” data, or anything and everything posted online.
Update July 12, 2023: This article has been updated to include a statement from Google’s general counsel.