Databricks has released the Dolly 2.0 instruction-following large language model (LLM) that is fine-tuned on a human-generated instruction dataset licensed for commercial use.
As the second iteration of Dolly introduced in March, Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new high-quality human-generated instruction-following dataset, crowdsourced among Databricks employees.
All of Dolly 2.0 is open sourced — including the training code, dataset and model weights — and suitable for commercial use. This allows any organisation to create, own and customise powerful LLMs that can talk to people without paying for API access or sharing data with third parties.
“Dolly 2.0 is a game changer as it enables all organisations around the world to build their own bespoke models for their particular use cases to automate things and make processes much more productive in the field they’re in,” said Ali Ghodsi, CEO of Databricks.
databricks-dolly-15k contains 15,000 high-quality human-generated prompt or response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify or extend this dataset for any purpose, including commercial applications.
This dataset was created to address the limitations of existing well-known instruction-following models that prohibit commercial use due to their training data. It is the world’s first open-source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT.
databricks-dolly-15k was authored by more than 5,000 Databricks employees in March and April 2023. These training records are natural, expressive and designed to represent a wide range of behaviours, from brainstorming and content generation to information extraction and summarisation.
Photo: Markus Spiske