Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
The company says Dolly 2.0 is the first open-source, instruction-following LLM fine-tuned on a transparent and freely available dataset that is also open-sourced to use for commercial purposes. That means Dolly 2.0 is available for commercial applications without the need to pay for API access or share data with third parties.
According to Databricks CEO Ali Ghodsi, while there are other LLMs out there that can be used for commercial purposes, “They won’t talk to you like Dolly 2.0.” And, he explained, users can modify and improve the training data because it is made freely available under an open-source license. “So you can make your own version of Dolly,” he said.
Databricks released the dataset Dolly 2.0 was trained on
In addition, Databricks said that as part of its ongoing commitment to open source, it is also releasing the dataset on which Dolly 2.0 was trained, called databricks-dolly-15k. This is a corpus of more than 15,000 records generated by thousands of Databricks employees, and Databricks says it is the “first open source, human-generated instruction corpus specifically designed to enable large language to exhibit the magical interactivity of ChatGPT.”
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
There has been a wave of instruction-following, ChatGPT-like LLM releases over the past two months that are considered open-source by many definitions (or offer some level of openness or gated access). One was Meta’s LLaMA, which in turn inspired others like Alpaca, Koala, Vicuna and Databricks’ Dolly 1.0.
Databricks, however, figured out how to get around this issue: Dolly 2.0 is a 12 billion-parameter language model based on the open-source Eleuther AI pythia model family and fine-tuned exclusively on a small, open-source corpus of instruction records (databricks-dolly-15k) generated by Databricks employees. This dataset’s licensing terms allow it to be used, modified and extended for any purpose, including academic or commercial applications.
Models trained on ChatGPT output have, up until now, been in a legal gray area. “The whole community has been tiptoeing around this and everybody’s releasing these models, but none of them could be used commercially,” said Ghodsi. “So that’s why we’re super excited.”
Dolly 2.0 is small but mighty
A Databricks blog post emphasized that like the original Dolly, the 2.0 version is not state-of-the-art, but “exhibits a surprisingly capable level of instruction-following behavior given the size of the training corpus.” The post adds that the level of effort and expense necessary to build powerful AI technologies is “orders of magnitude less than previously imagined.”
“Everyone else wants to go bigger, but we’re actually interested in smaller,” Ghodsi said of Dolly’s diminutive size. “Second, it’s high-quality. We looked over all the answers.”
Ghodi added that he believes Dolly 2.0 will start a “snowball” effect — where others in the AI community can join in and come up with other alternatives. The limit on commercial use, he explained, was a big obstacle to overcome: “We’re excited now that we finally found a way around it. I promise you’re going to see people applying the 15,000 questions to every model that exists out there, and they’re going to see how many of these models suddenly become kind of magical, where you can interact with them.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.