JAKARTA - Some artificial intelligence (AI) companies violate common web standards used by publishers to block their content retrieval from being used in a generating AI system. This was revealed by content licensing startup TollBit.
In a letter to the publisher on Friday, which did not name the AI company or the affected publisher, the issue comes amid a public dispute between the AI Perplexity search startup and Forbes media over the same web standards and broader debate between technology and media companies about content value in the generative AI era.
Business media publishers publicly accuse Perplexity of plagiarizing its investigative story in a summary generated by AI without citing Forbes or asking for permission.
Investigations published by Wired this week found Perplexity likely bypassed efforts to block its web cranes through the Robot's Exception Protocol, or "robots.txt," a widely accepted standard that determines which part of the site can be chaotic.
News Media Alliance, a trading group representing more than 2,200 US-based publishers, expressed concerns about the impact of ignoring "do not crane" signals on its members. "Without the ability to opt out of mass data taking, we cannot monetize our valuable content and pay journalists. This could seriously damage our industry," said Danielle Coffey, president of the group.
TollBit, an early-stage startup, positions itself as an intermediary between an AI company that requires content and publishers willing to make licensing agreements with them. The company tracks AI traffic to publisher sites and uses analytics to help both parties set costs for the use of various types of content.
According to a letter from TollBit, Perplexity is not the only violator who appears to be ignoring robots.txt. TollBit says its analytics shows "many" AI agents bypassing the protocol.
The robotics.txt protocol was created in the mid-1990s as a way to avoid overloading websites with web cranes. While there is no clear law enforcement mechanism, historically there has been broad compliance on the web, and some groups - including the News Media Alliance - say there may still be legal remedies for publishers.
SEE ALSO:
More recently, robots.txt has become a key tool used by publishers to block tech companies from trying to take their content for free for use in a generating AI system that can mimic human creativity and immediately summarize articles.
Several publishers, including the New York Times, have sued AI companies for copyright infringement related to the use. Others signed licensing agreements with AI companies that are willing to pay for content, although they often disagree on the value of the material. Many AI developers argue that they do not violate the law in accessing content for free.
Thomson Reuters, owner of Reuters News, is one that has struck a deal to license news content for use by AI models.
Publishers have been raising concerns about news summarys since Google launched a product last year that used AI to create a summary in response to some search queries. If publishers want to prevent their content from being used by AI Google to help generate that summary, they should use the same tools that will also prevent their content from appearing in Google search results, making it almost invisible on the web.
The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)