AI companies extract billions whilst website operators bear crushing infrastructure costs

21 Aug, 2025

Silicon Valley's newest gold rush extracts billions from content creators who built the web, then forces them to pay the infrastructure costs

Every six hours, an invisible army descends upon the internet. These aren't hackers or criminals, but the crawling bots of artificial intelligence companies, systematically harvesting decades of human creativity and knowledge. They arrive in waves so intense they can generate 39,000 requests per minute against a single website—a digital blitzkrieg that has transformed the web from a collaborative commons into an extraction zone.

The numbers reveal an industry-scale heist hidden in plain sight. Meta's AI division alone accounts for more than half of all automated web traffic. OpenAI dominates real-time data fetching with 98% market share. Together, these companies and their competitors have reversed the internet's fundamental economic compact, extracting enormous value whilst imposing crippling infrastructure costs on the very communities they're systematically pillaging.

For website operators, the financial impact arrives as suddenly as a server crash. The Wikimedia Foundation watches helplessly as bandwidth costs surge 50% annually, driven not by human readers but by bots harvesting openly licensed images for AI training. Read the Docs, which provides free documentation hosting for software projects, faced monthly bills jumping from £1,200 to £4,800 before desperately blocking AI crawlers to avoid bankruptcy. Hosting provider Pantheon now charges £32 for every additional 10,000 visits above allocated limits—a tax on popularity that many smaller sites simply cannot afford.

The scale approaches economic warfare. Digital marketing firm DesignRush calculates that businesses wasted £190 billion in 2024 serving bot traffic that generates zero meaningful engagement. This represents history's largest wealth transfer from content creators to technology corporations, achieved not through negotiation or compensation, but through brute-force technical extraction that treats the web's creators as involuntary servants to artificial intelligence.

Industrial-scale digital colonialism

Traditional search engines operated like considerate houseguests—they knocked politely, took what they needed, and brought valuable visitors in return. AI crawlers behave like locusts. Dennis Schubert, who maintains servers for the Diaspora social network, describes their relentless appetite: "They don't just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not."

This systematic extraction has fundamentally altered internet traffic patterns. Bots now represent 80% of all web visits, meaning only one in five clicks comes from an actual human being. The Wikimedia Foundation discovered that whilst bots generate just 35% of page views, they consume 65% of computational resources—a disparity that reveals how AI companies have weaponised efficiency against the communities they exploit.

The technical sophistication makes resistance nearly futile. Modern AI crawlers render JavaScript, parse complex applications, and adapt to various content management systems with the precision of military reconnaissance. They access not just public websites but probe developer infrastructure, code repositories, and private documentation systems. Some rotate through different IP addresses like digital shapeshifters. Others impersonate human browsers so convincingly they slip past standard defences.

Perplexity AI exemplifies this technological deception. When Cloudflare investigated customer complaints about unauthorised access, they discovered Perplexity deploying crawlers disguised as ordinary Chrome browsers on Apple computers. These stealth bots systematically violated robots.txt files—the internet's equivalent of "No Trespassing" signs—using undeclared IP addresses to circumvent blocks specifically designed to keep them out. When confronted with evidence, Perplexity claimed their behaviour represented legitimate user-driven research rather than systematic harvesting.

Such technical sophistication serves a singular purpose: maximum extraction with minimum accountability. The robots.txt protocol, which has governed automated web access for three decades, has become meaningless as AI companies either ignore its directives entirely or deploy increasingly sophisticated evasion techniques. This represents more than rule-breaking—it constitutes a fundamental rejection of the collaborative norms that enabled the internet's growth.

The resistance: digital barriers and proof-of-work prophecies

Faced with this digital colonisation, website operators have begun deploying technologies that would have seemed dystopian just years ago. The most striking is Anubis, a system that forces every visitor to solve cryptographic puzzles before accessing content. Created by software engineer Xe Iaso after Amazon's crawler overwhelmed their Git server, Anubis treats every web request as potentially hostile until proven otherwise.

The system works with elegant brutality. Human visitors see a brief loading screen featuring an anime-styled jackal girl whilst their browser performs invisible computational work. For AI companies running industrial crawling operations, this translates into server farms spinning fans at maximum velocity, burning electricity and computational resources on a massive scale. The economic calculus shifts dramatically when each page view requires meaningful computational investment rather than trivial bandwidth consumption.

Anubis has spread through the open-source community like wildfire, adopted by major projects including UNESCO, the WINE compatibility layer, and GNOME desktop environment. SourceHut, a code hosting platform, initially deployed what its operator called "the nuclear option" before switching to alternative measures. The rapid adoption reveals widespread recognition that polite requests for restraint have failed comprehensively.

These defensive measures represent an architectural revolution. For the first time since the web's creation, significant portions of human knowledge are being placed behind computational barriers specifically designed to exclude automated access. This reverses decades of progress toward universal information accessibility, creating a digital arms race where access to knowledge requires constant proof of humanity.

The irony cuts deep. The same AI companies promoting artificial intelligence as democratising technology have forced the internet's creators to implement explicitly anti-democratic barriers that make information harder to access. Their industrial-scale harvesting has transformed openness from a virtue into a vulnerability, forcing communities to choose between accessibility and survival.

Economic asymmetry and the breakdown of reciprocity

The traditional web operated on elegant reciprocity. Search engines crawled content to build indexes, then directed users back to original sources, generating traffic and revenue for content creators. This created sustainable cycles where both crawlers and websites prospered from mutual exchange. AI companies have obliterated this compact, extracting content for proprietary commercial use whilst providing virtually nothing in return.

The mathematical brutality is undeniable. Research firm TollBit documents that Perplexity AI generates 369 scraping requests for every human visitor it refers back to source websites. Anthropic's Claude produces an staggering 8,692 crawling operations per human referral. OpenAI generates 179 scrapes for each person it sends back. These ratios reveal extraction economies so predatory they make historical colonial trading relationships appear generous by comparison.

Traditional search engines maintained roughly balanced exchanges—crawling content whilst driving substantial traffic back to source sites. AI companies have transformed this symbiosis into parasitism, consuming vast computational resources whilst contributing minimal value to the communities they systematically harvest. They extract the cumulative labour of millions of individuals and smaller organisations to build proprietary products that generate billions in venture capital investment and commercial revenue.

The impact cascades through the internet's economic foundation. Cloudflare reports that over one million websites chose to block AI crawlers entirely within months of the option becoming available. This mass defensive deployment signals widespread recognition that current extraction models are unsustainable for content creators whilst remaining extraordinarily profitable for AI companies that monetise harvested data without compensation or meaningful reciprocity.

Website operators now face impossible choices: permit unrestricted access and risk infrastructure collapse from overwhelming bot traffic, or implement blocking measures that reduce accessibility for legitimate human users. This represents complete market failure, where rational individual decisions by AI companies create collectively destructive outcomes for the internet's foundational communities.

The sophistication paradox

The most damning aspect of this extraction economy lies in its technical sophistication. AI companies possess extraordinary engineering capabilities that could, in principle, enable respectful and sustainable automated access. Modern crawlers represent significant computational achievements, capable of rendering complex applications and adapting to diverse content management systems with remarkable efficiency.

Instead of deploying these capabilities within collaborative frameworks, AI companies have chosen to operate as digital conquistadors, using technical excellence to maximise extraction whilst ignoring community preferences and sustainability concerns. They possess the resources to develop efficient, respectful crawling systems that could operate within established protocols whilst providing meaningful value to content creators.

The deliberate choice to operate extractively rather than collaboratively reveals the ideology driving AI development. These companies view the internet not as a shared commons requiring stewardship, but as an undeveloped resource awaiting industrial exploitation. Their technical prowess serves acquisition rather than cooperation, efficiency rather than reciprocity.

This represents a profound departure from the internet's founding philosophy of collaborative knowledge sharing. The web's creators anticipated automated access but assumed it would operate within frameworks of mutual benefit and respect for community preferences. AI companies have proven these assumptions catastrophically naive, using advanced technology to circumvent rather than honour the social contracts that enabled the internet's growth.

Regulatory awakening and commercial evolution

The fundamental asymmetries driving current conflicts make regulatory intervention increasingly inevitable. Self-regulation has failed comprehensively, with AI companies expanding extraction operations despite widespread complaints from infrastructure providers and content creators. The European Union's AI Act provides initial frameworks for addressing these issues, but enforcement remains limited and technical countermeasures often prove more effective than legal ones.

Xe Iaso, creator of the Anubis defence system, argues that "governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming." This sentiment reflects growing recognition that current extraction models represent market failure requiring direct intervention rather than purely technological solutions.

Several infrastructure companies are pioneering commercial alternatives to the current free-for-all. Cloudflare's "pay-per-crawl" system would allow website operators to charge AI companies for access to their content. TollBit operates as an intermediary helping publishers monetise content for AI training purposes. These initiatives acknowledge that free extraction models are unsustainable and must evolve toward explicit economic relationships with transparent compensation mechanisms.

However, technical solutions alone cannot address the power imbalances enabling current extraction economies. AI companies possess vast computational resources and engineering expertise that allow them to circumvent most defensive measures given sufficient motivation. Only regulatory frameworks with meaningful enforcement mechanisms can establish sustainable boundaries for automated content extraction whilst preserving the internet's collaborative character.

Digital futures and the commons under siege

This conflict represents a critical juncture for internet development. The collaborative model that enabled the web's extraordinary growth—where individuals and organisations freely shared knowledge in exchange for broad accessibility and mutual benefit—faces existential threat from corporations that extract value whilst contributing little meaningful reciprocity.

The deployment of defensive technologies like proof-of-work challenges signals a retreat from openness toward a more fortified internet where accessing information requires constantly proving one's humanity. This represents a profound philosophical shift from the web's founding principles of universal accessibility and free information exchange, driven not by security concerns but by economic exploitation.

Yet sustainable alternatives remain achievable. Several AI companies have established formal partnerships with content publishers, recognising that long-term extraction requires explicit agreements rather than unilateral harvesting. These arrangements could serve as templates for broader industry transformation if implemented with appropriate transparency, compensation mechanisms, and respect for community preferences.

The ultimate resolution will likely require combining technical standards, commercial frameworks, and regulatory oversight that restore balance between content creators and AI companies. The current trajectory—toward an increasingly militarised internet where access requires constant authentication and human knowledge becomes commoditised for artificial intelligence training—benefits only the largest technology corporations whilst impoverishing the communities that created the web's extraordinary wealth of information.

The next few years will determine whether the internet remains a collaborative commons or transforms permanently into a commercial extraction zone. The communities deploying defensive technologies today fight not merely for their own resources, but for the fundamental character of digital civilisation. Their success or failure will determine whether future generations inherit an open internet built on collaboration and mutual benefit, or one permanently divided between those who extract value and those who bear the costs of its creation.

#artificial intelligence