Cleanlab Raises $25M to Wipe Out Generative AI Hallucinations
Enterprise AI data startup Cleanlab has raised $25 million in a Series A funding round led by Menlo Ventures and TQ Ventures. Cleanlab employs its automated data curation to spot and fix issues in datasets, including the large language models employed for generative AI functions like creating text and images.
Cleaning AI Models
Cleanlab offers clients a platform for combing through data in used for analytics, large language models, and other AI applications. Cleanlab’s technology is tuned to identifying and fixing issues in disorganized real-world data, boosting the reliability and utility of any applications derived from that data with far less need for expensive manual data cleansing. That includes recent additions to its features, like checking unreliable generative AI model outputs using its Trustworthy Language Model to score results based on reliability. You can see in the image above how the model spots a picture identified as a rooster that should be relabeled as a house finch instead. The tech extends to combing through text data for problems that may lead to inaccurate responses from a generative AI chatbot or bizarre and impossible details in an image generated through text-to-image engines like DALL-E or Stable Diffusion.
“After working with companies like Microsoft and Tesla to get their AI-driven products to function better and helping MIT and Harvard detect cheating, it became clear that mislabeled and poorly curated data was the core issue behind these challenges,” Cleanlab CEO Curtis Northcutt explained. “It’s the culmination of over a decade of work to introduce Cleanlab Studio, which reimagines what AI and analytics can do for people and enterprises now that we can automate data curation and reliability.”
The new round of funding also brings in Databricks Ventures as a participant. The investment arm of Databricks getting involved comes after the data infrastructure giant fine-tuned an OpenAI Davinci LLM with Cleanlab and saw errors fall 37% while accuracy rose from 65% to 78% overall. Databricks recently raised $500 million at a $43 billion valuation, and the company is keen to widen its generative AI footprint after paying $1.3 billion to acquire LLM training and generative AI tool MosaicML. Databricks claims that merging MosaicML’s models with Databricks’ existing technology will drop the price for training and deploying LLMs from millions to thousands of dollars. Cleanlab may make that data a lot more useful.
Cleanlab has assisted in many major AI projects, claiming more than 10% of Fortune 500 companies as clients, with AWS, ByteDance, HuggingFace, Oracle, and Walmart among them. Google has also leveraged Cleanlab to scan and fix label errors in millions of speech samples in multiple languages, then used the improved data to train speech models.
“Cleanlab is well-designed, scalable and theoretically grounded: it accurately finds data errors, even on well-known and established datasets. After using it for a successful pilot project at Google, Cleanlab is now one of my go-to libraries for dataset cleanup.”