Nearly Right

Anthropic faces £600 billion lawsuit as AI industry's copyright foundations crumble

A landmark class action reveals how artificial intelligence companies built their empires on legally questionable ground, threatening to reshape an entire industry

The artificial intelligence industry is confronting an existential threat disguised as a copyright lawsuit. Last month, a federal judge certified the largest intellectual property class action in AI history against Anthropic, maker of the Claude chatbot. The potential damages exceed £600 billion—enough to obliterate the company and send shockwaves through Silicon Valley.

This isn't corporate hyperbole. Court documents reveal Anthropic systematically downloaded seven million copyrighted books from notorious piracy sites to train its AI models. Under US law, each violation carries penalties up to $150,000. The arithmetic is brutal: millions of works multiplied by maximum statutory damages equals an industry-ending sum.

But Anthropic's predicament illuminates a deeper crisis. The entire AI boom has been built on data acquisition practices that courts are now rejecting as wholesale copyright theft.

How piracy became AI's foundation

The Anthropic case exposes the uncomfortable reality of how modern AI systems are trained. To build Claude, the company didn't just purchase books or negotiate licensing deals. Instead, it systematically harvested five million books from Library Genesis and two million from Pirate Library Mirror—shadow libraries that distribute copyrighted works without permission.

Anthropic's internal communications, revealed during litigation, show the company sought to create a permanent "research library" containing "all the books in the world." This wasn't casual infringement but systematic appropriation on an unprecedented scale.

Senior District Judge William Alsup's split ruling crystallises the legal danger facing the industry. In June, he found that training AI on legitimately purchased books constitutes fair use—a transformative application akin to human learning. But downloading millions of pirated works to build a corporate library? That crossed into straightforward copyright theft.

The distinction matters enormously. Whilst the fair use ruling offers some protection for companies using legitimate data sources, it simultaneously exposes how current AI systems depend on legally indefensible practices.

This legal vulnerability extends far beyond Anthropic. OpenAI reportedly transcribed over a million hours of YouTube videos without permission. Meta faces accusations of training LLaMA models on pirated content. Google's data practices have triggered lawsuits from publishers worldwide.

The economics of digital piracy at scale

The industry's reliance on questionable data sources isn't accidental—it's economically rational under current incentives. Training cutting-edge AI models now costs between $30-191 million before accounting for research staff, according to Epoch AI. These costs are growing 2.4 times annually, putting frontier development beyond reach of all but the largest companies.

Legitimate data licensing adds another crushing burden. Shutterstock commands $25-50 million for its image libraries. Reddit extracts hundreds of millions from tech giants desperate for user-generated content. Academic publishers charge tens of millions for research paper access.

The arithmetic is stark: pirated data costs nothing, whilst legitimate licensing can consume hundreds of millions in additional expenses. For companies already spending fortunes on computing infrastructure, the temptation to cut corners on data acquisition has proven irresistible.

This creates perverse market dynamics. As AI researcher Sarah Lo observes, "Smaller players won't be able to afford these data licenses, and therefore won't be able to develop or study AI models." Only tech giants with vast resources can afford both legitimate licensing and computational requirements for frontier development.

Academic institutions, startups, and open-source initiatives find themselves systematically excluded from advanced AI research—not through lack of talent, but through inability to pay licensing fees that can exceed university research budgets.

Fair use meets industrial-scale copying

AI companies defend their practices by invoking fair use doctrine, pointing to precedents like Google Books. In that landmark case, courts ruled that digitising millions of books for search purposes constituted transformative use because it enabled new capabilities rather than simply reproducing content.

But generative AI presents fundamentally different challenges. Unlike search engines that organise and reference existing material, AI models produce outputs that can directly compete with training data. When Claude writes poetry or OpenAI's models generate news articles, they're not transforming source material—they're potentially displacing it.

The US Copyright Office's recent analysis highlights this distinction. When AI models generate content that "shares the purpose of appealing to a particular audience" with original works, the transformative use argument becomes "at best, modestly transformative."

Courts are beginning to reflect this nuanced understanding. Whilst Judge Alsup protected AI training on legitimately acquired content, he explicitly rejected Anthropic's argument that pirated materials should receive the same protection. The company's systematic downloading from illegal sources, he ruled, cannot be excused by subsequent transformative use.

International fragmentation enables regulatory arbitrage

Global attempts to address AI copyright issues have created a confusing patchwork of conflicting rules that companies can exploit rather than meaningful protection for creators.

The European Union's AI Act takes the most aggressive stance, requiring companies to respect copyright holders' "opt-out" rights regardless of where AI training occurs. Rights holders can use machine-readable signals like robots.txt files to prohibit their content from AI training datasets.

In theory, this extraterritorial approach prevents companies from simply relocating training operations to more permissive jurisdictions. In practice, implementation remains chaotic. A recent German court case highlighted the confusion: whilst judges suggested website terms and conditions might constitute valid opt-outs, no standardised approach exists for signalling preferences or detecting violations.

The UK initially proposed broader exemptions for AI training but reversed course last December, announcing consultations on an EU-style system. However, British officials acknowledge that practical opt-out mechanisms remain underdeveloped and "uncertainty about how it works in practice" persists.

This regulatory fragmentation enables sophisticated arbitrage. Companies can potentially train models in jurisdictions with weak copyright protections, then distribute results globally. The technical impossibility of "untraining" models on specific datasets means enforcement relies on after-the-fact detection—challenging given AI systems' opacity.

More fundamentally, opt-out mechanisms place enforcement burdens on individual creators rather than creating systemic incentives for legitimate data acquisition.

The reckoning arrives

The Anthropic class action represents more than legal jeopardy for one company—it's a stress test for the entire industry's business model. If courts consistently rule against AI companies using pirated content, and if damage awards prove substantial enough to threaten viability, the current boom could face systematic restructuring.

Trade groups are panicking accordingly. The Chamber of Progress and Computer & Communications Industry Association filed emergency appeals warning that class certification could "financially ruin" the sector. Their language reveals the stakes: "the technology industry cannot withstand such devastating litigation" as "generative AI begins to shape the trajectory of the global economy."

This panic reflects uncomfortable awareness that most major AI companies have followed similar practices. The Anthropic precedent could unleash a cascade of class actions targeting every major player.

Legal scholar Ed Lee calculates that even modest scenarios—100,000 works at standard damage ranges—could generate $1-3 billion in liability. With evidence suggesting millions of infringed works, upper bounds become astronomical.

Forced evolution or managed decline

Several potential paths forward are emerging, though none offer easy solutions for an industry built on questionable foundations.

Collective licensing schemes could provide scalable alternatives to individual negotiations. The Authors Guild is exploring collective bargaining approaches that would standardise terms and distribute proceeds among creators, reducing transaction costs whilst ensuring compensation.

Compulsory licensing represents another option, requiring predetermined royalties for copyrighted works in AI training. However, the Copyright Office notes concerns about market distortion and rate-setting difficulties.

Technical solutions like "machine unlearning"—removing specific works' influence from trained models—remain experimental. Even if perfected, such approaches might not satisfy rights holders demanding preventative rather than corrective measures.

The most immediate requirement is transparency. The EU AI Act mandates detailed training data documentation, enabling rights holders to identify violations and negotiate licensing. Similar requirements under consideration elsewhere could force systematic disclosure of current practices.

Without such reforms, the industry faces increasingly stark choices. Courts ruling consistently against pirated content use, combined with substantial damage awards, could force dramatic training data reduction or comprehensive licensing programmes only tech giants could afford.

The end of AI exceptionalism

The Anthropic trial, scheduled for December, will test whether AI companies can continue treating copyright as an optional constraint. Industry observers describe potential outcomes as "business-ending," but implications extend far beyond one firm's fate.

Early signs suggest the era of AI exceptionalism is ending. Companies like Adobe demonstrate that models can be trained exclusively on licensed content, though at smaller scales than industry leaders. OpenAI has struck licensing deals with The Financial Times and The Atlantic, proving negotiated arrangements are possible.

But these remain exceptions. For an industry that has grown accustomed to treating the internet as its private training ground, adapting to copyright constraints represents a fundamental shift.

The transformation is already forcing difficult questions. Can AI capabilities advance without systematic copyright infringement? Will legitimate licensing costs create innovation oligopolies? How do societies balance creator rights with technological development?

The Anthropic case won't answer these questions definitively, but it will establish whether current AI business models can survive legal scrutiny. For an industry built on moving fast and breaking things, that reckoning may prove more disruptive than any algorithm.

#artificial intelligence