OpenAI is Looking for Partners in Building Generative AI Training Datasets

OpenAI has unveiled a new data collaboration program that will see the ChatGPT creator work with various organizations on curating specialized datasets. The OpenAI Data Partnerships aim to set up public and private training sets that are able to serve different domains better than generalized large language models (LLMs).

OpenAI Data Partnerships

OpenAI is looking to create systems that deeply understand diverse subjects and industries. The solution is building broad, representative data is key to developing safe and beneficial generative AI. The partnership allows organizations to harness the potential of their unique data. There are currently two offered partnership tracks: open-source archive and private datasets. The public archive seeks content to create a broadly usable training corpus. Private data remains restricted while improving model performance. Partners can choose their preferred level of access and data sensitivity controls. OpenAI said it won’t accept personal information or third-party-owned content. Any sensitive data can be removed through collaboration. OpenAI said it can help process quantities of information at scale using internal AI systems for transcription, digitization, and cleaning.

“Modern AI technology learns skills and aspects of our world — of people, our motivations, interactions, and the way we communicate — by making sense of the data on which it’s trained. To ultimately make AGI that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures, and languages, which requires as broad a training dataset as possible,” OpenAI explained in a blog post. “Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone. Together, we can move towards AGI that benefits all of humanity.”

By supplying relevant content, partners can increase AI capabilities in their niche while also supporting overall progress. The company is seeking large volumes of text, audio, images, and video reflecting diverse aspects of human culture and intent. This includes underrepresented languages and subject matter. The language element led OpenAI to start working with Iceland’s government to improve its Icelandic language proficiency. Teaching Icelandic to AI has been a focus for the country and its government and developers for many years now.

By expanding data diversity, OpenAI aims to create maximally helpful AI for all of humanity. However, quality curation and consideration of biases will be crucial to ensure models don’t absorb societal issues. Whether sufficient data can be sourced ethically and responsibly to achieve artificial general intelligence remains an open question. Still, the partnership program demonstrates OpenAI’s continued push to scale its capabilities, as evidenced at its first Dev Day this week.

