When playing defense against AI doesn't work
With AI scraping content and ignoring paywalls, publishers seeming face a losing game no matter what they choose.
Now that we’re three years into the generative AI era, you’d think that there would be an emerging consensus between the media industry and AI companies, but, in some ways, the gap between the two has never been wider. Recent reporting suggests the two sides are talking past each other with regard to how AI systems deal with publisher paywalls and archives. How can you create content strategy when the key players can’t even agree on the rules?
I dive into that question in a minute, but first I’m excited to announce the launch of the very first Media Copilot audience survey. As I alluded to in my two-year anniversary post, we have some big plans coming with respect to content, and we’d like to better understand what you, dear reader, would like to see from The Media Copilot.
It would mean a lot if you could take two minutes to answer the survey. Sharing your email is optional, but if you do you’ll be entitled to a complimentary 30-day paid subscription to the newsletter, granting access to the entire archive.
Finally, if you’re interested in sponsoring the newsletter or podcast, we still have inventory for Q1 available. To get your brand in front of our 12,000-strong audience of media and communications professionals, reach out at team@mediacopilot.ai and our team will respond quickly.
A MESSAGE FROM TOLLBIT
AI is already scraping your content, often without permission or payment.
As AI systems increasingly rely on website content to power their products, the publishing industry faces a challenge: how to control and capture value from this growing wave of autonomous traffic.
TollBit is the platform built for publisher sites to monitor, manage, and monetize AI traffic — turning automated scraping into a potential new revenue stream.
Monitor with TollBit Analytics: Gain comprehensive insights into AI bot and agent traffic. Identify AI bot interactions and understand exactly how your content is being accessed.
Manage with Bot Paywall: Enhance access control and prevent unwanted scraping through active enforcement. Redirect unauthorized traffic to a clear, machine-readable paywall.
Monetize: Set custom rates for your content to ensure fair compensation from AI traffic while maintaining full control over your online assets.
TollBit works seamlessly with your existing CDN and tech stack and is free for publisher sites to use.
Join a growing network of 5,000+ publishers sites — including TIME, AP, ADWEEK, TNL Mediagene, and more — already shaping the future of AI-content access.
👉 Take control of your AI traffic — visit TollBit.com to get learn more.

Can publishers make AI’s hunger for content work for them?
If you’re in the business of publishing content online, figuring out how to respond to AI has been a real challenge. Obviously, you can’t ignore it; large language models (LLMs) and AI search engines are here, and they ingest your content and summarize it for their users, killing valuable traffic to your site. Plenty of data supports this.
Creating a content strategy that accounts for this changing reality is complex to begin with. You need to decide what content to expose to AI systems, what to block from them, and how both of those activities can serve your business.
That would be hard even if there were clear rules that everyone was operating under. But that is far from a given in the AI world. A topic I’ve revisited more than once is how tech and media view some aspects of the ecosystem differently (most notably, user agents), leading to new industry alliances, myriad lawsuits, and several angry blog posts. But even accounting for that, a pair of recent reports suggest the two sides are even further apart than you might think.
Common Crawl and the tension around scraped content
Common Crawl is a vast trove of internet data that many AI systems use for training. It was a fundamental part of GPT-3.5, the model that powered ChatGPT when it was released to the world back in 2022, and many other LLMs are also based on it. Over the past three years, however, the issue of copyright and training data has become a major source of controversy, and several publishers have requested that Common Crawl delete their content from its archive to prevent AI models from training on it.
A report from The Atlantic suggests that Common Crawl hasn’t complied, keeping the content in the archive while making it invisible to its online search tool—meaning any spot checks would come up empty. Common Crawl’s executive director, Rich Skrenta, told the publication that it complies with removal requests, and he later responded to the report with a detailed explanation that pointed out the complicated reality of deleting content from what is supposed to be an “immutable” archive. That said, he also appears to support the point of view that anything online should be fair game for training LLMs, telling The Atlantic’s reporter, “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet.”




