Data is the fuel of artificial intelligence. It is also a bottleneck for big businesses, because they are reluctant to fully embrace the technology without knowing more about the data used to build A.I. programs.
Now, a consortium of companies has developed standards for describing the origin, history and legal rights to data. The standards are essentially a labeling system for where, when and how data was collected and generated, as well as its intended use and restrictions.
The data provenance standards, announced on Thursday, have been developed by the Data & Trust Alliance, a nonprofit group made up of two dozen mainly large companies and organizations, including American Express, Humana, IBM, Pfizer, UPS and Walmart, as well as a few start-ups.
The alliance members believe the data-labeling system will be similar to the fundamental standards for food safety that require basic information like where food came from, who produced and grew it and who handled the food on its way to a grocery shelf.
Greater clarity and more information about the data used in A.I. models, executives say, will bolster corporate confidence in the technology. How widely the proposed standards will be used is uncertain, and much will depend on how easy the standards are to apply and automate. But standards have accelerated the use of every significant technology, from electricity to the internet.
“This is a step toward managing data as an asset, which is what everyone in industry is trying to do today,” said Ken Finnerty, president for information technology and data analytics at UPS. “To do that, you have to know where the data was created, under what circumstances, its intended purpose and where it’s legal to use or not.”
Surveys point to the need for greater confidence in data and for improved efficiency in data handling. In one poll of corporate chief executives, a majority cited “concerns about data lineage or provenance” as a key barrier to A.I. adoption. And a survey of data scientists found that they spent nearly 40 percent of their time on data preparation tasks.
The data initiative is mainly intended for business data that companies use to make their own A.I. programs or data they may selectively feed into A.I. systems from companies like Google, OpenAI, Microsoft and Anthropic. The more accurate and trustworthy the data, the more reliable the A.I.-generated answers.
For years, companies have been using A.I. in applications that range from tailoring product recommendations to predicting when jet engines will need maintenance.
But the rise in the past year of the so-called generative A.I. that powers chatbots like OpenAI’s ChatGPT has heightened concerns about the use and misuse of data. These systems can generate text and computer code with humanlike fluency, yet they often make things up — “hallucinate,” as researchers put it — depending on the data they access and assemble.
Companies do not typically allow their workers to freely use the consumer versions of the chatbots. But they are using their own data in pilot projects that use the generative capabilities of the A.I. systems to help write business reports, presentations and computer code. And that corporate data can come from many sources, including customers, suppliers, weather and location data.
“The secret sauce is not the model,” said Rob Thomas, IBM’s senior vice president of software. “It’s the data.”
In the new system, there are eight basic standards, including lineage, source, legal rights, data type and generation method. Then there are more detailed descriptions for most of the standards — such as noting that the data came from social media or industrial sensors, for example.
The data documentation can be done in a variety of widely used technical formats. Companies in the data consortium have been testing the standards to improve and refine them, and the plan is to make them available to the public early next year.
Labeling data by type, date and source has been done by individual companies and industries. But the consortium says these are the first detailed standards meant to be used across all industries.
“My whole life I’ve spent drowning in data and trying to figure out what I can use and what is accurate, ” said Thi Montalvo, a data scientist and vice president of reporting and analytics at Transcarent.
Transcarent, a member of the data consortium, is a start-up that relies on data analysis and machine-learning models to personalize health care and speed payment to providers.
The benefit of the data standards, Ms. Montalvo said, comes from greater transparency for everyone in the data supply chain. That work flow often begins with negotiating contracts with insurers for access to claims data and continues with the start-up’s data scientists, statisticians and health economists who build predictive models to guide treatment for patients.
At each stage, knowing more about the data sooner should increase efficiency and eliminate repetitive work, potentially reducing the time spent on data projects by 15 to 20 percent, Ms. Montalvo estimates.
The data consortium says the A.I. market today needs the clarity the group’s data-labeling standards can provide. “This can help solve some of the problems in A.I. that everyone is talking about,” said Chris Hazard, a co-founder and the chief technology officer of Howso, a start-up that makes data-analysis tools and A.I. software.