Reddit will start charging AI models learning from its extremely human archives
If you’re a business training a large language model (LLM) AI and want it to learn from the u/420NarutoConspiracy subreddit, you’ll soon have to pay for that.
Steve Huffman, founder and CEO of social news and discussion aggregator Reddit, told The New York Times recently that it planned to charge companies accessing its API for the purpose of pulling its 18 years’ worth of content generated mostly by humans. Details on the new terms are available in a subsequent announcement post on Reddit.
The API would still be free to developers working on bots and other Reddit tools, and researchers working on academic or non-commercial projects. But simply mainlining Reddit’s conversations for AI training purposes will come with a price, the exact amounts of which should arrive in the coming weeks.
“The Reddit corpus of data is really valuable,” Huffman told the Times. “But we don’t need to give all of that value to some of the largest companies in the world for free.
“Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with. It’s a good time for us to tighten things up.”
Reddit’s comments and conversations have been a rich resource for training LLM AIs. ChatGPT and Google’s Bard cite Reddit data as one of their sources. In their analysis of just one subset (12 million) of Stable Diffusion’s image generation dataset (2.3 billion), Andy Baio and Simon Willison noted that “user-generated content platforms were a huge source for the image data.” An investigation into common data sources for many AIs published today by The Washington Post noted that “a compilation of text from links highly rated by Reddit users” is included in GPT-3.
While it intends to limit access to AIs, Reddit said it intends to give developers and moderators better tools for working within their communities. Reddit’s iOS and Android apps will offer ways to quickly view a user’s history, update community rules, and better handle multiple mod queues.
Reddit’s shift on API access comes as the company is looking to go public in the second half of 2023, according to The Information. The company confidentially filed for an initial public offering in December 2021. It had hoped for a $15 billion valuation, according to Reuters, but has held off on its filing until market conditions, especially around tech companies, improve.
Reddit is partially owned by Advance Publications, which also owns Ars Technica parent Condé Nast.