OpenAI Introduced GPTBot Web Crawler to Index Websites
OpenAI has released a web crawler named GPTBot that can index website content, though with an opt-out option for website owners. GPTBot was released without any announcement, just new documentation about how to use and identify it.
Like other web crawlers, GPTBot gathers data from the internet. GPTBot looks for information that might be useful for its large language models. OpenAI presumably uses that information to keep GPT-4 and its other LLMs, such as the rumored GPT-5 or open-source G3PO, up-to-date and to train them to perform better over time. It’s a way for OpenAI to maintain more control over the data it employs for that training compared to relying on databases compiled by third parties. Supposedly it will filter out any sources behind a paywall or that violate its policies and privacy rules. Those who allow GPTBot access to their site could help improve how well generative AI performs, but OpenAI also offers web administrators the option to forbid GPTBot from accessing their site or limit it to certain directories by adjusting their robots.txt file.
“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI’s documentation explains. “Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”
Of course, any discussion of AI model training raises ethical and legal questions, which the relatively transparent nature of GPTBot might circumvent. Whether there’s any active benefit for website owners is debatable, however. While allowing search engine crawlers helps website by making them easier to find, LLMs don’t necessarily cite their sources or put out links. So a website might make ChatGPT better at answering questions without garnering any traffic as a result. As Voicebot and Synthedia founder Bret Kinsella points out, the options make OpenAI seem much more like a “good digital citizen,” which is becoming more important as regulators around the world scrutinize the company and the generative AI industry as a whole.
“The new option could also blunt OpenAI’s risk from complaints about using proprietary information without consent. Granted, legal disputes around intellectual property issues and generative AI models are in their infancy,” Kinsella said. “This move will not offer OpenAI blanket protection against claims of unauthorized use. It will show a standard of care based on an implicit opt-in and optional opt-out. It also suggests that OpenAI may defend its use of data published on the web as similar to a search engine’s use which is settled law.”
And while ChatGPT doesn’t send traffic to websites, it has a huge reach, and website publishers might want to get their information into the AI model regardless. Plus, as Perplexity and other LLMs show, links and citations are a possible feature for conversational generative AI chatbots and it could become a ChatGPT feature in the future.