If AI is the 'gas guzzler' of data, how do we get better mileage?

While data quality has been top of mind for years, identifying data essential for AI and training models is another challenge.
Written by Joe McKendrick, Contributing Writer
Christian Prandl/Getty Images

Can we tame the glut of inadequate or questionable data moving through artificial intelligence systems? AI is hampered by hallucinations, bias, polluted training data, and -- ultimately -- organizational uncertainty. Industry leaders and thinkers have some ideas for getting data in order.

If data is the new oil, then AI, "which needs lots and lots of it, is the 'gas guzzler' of data," Andy Thurai, principal analyst with Constellation Research, told ZDNET. However, consuming large volumes of data risks a loss of quality in the process -- creating trust issues with AI.

Also: From AI trainers to ethicists: AI may obsolete some jobs but generate new ones

A Salesforce survey of 6,000 employees found that three-quarters don't trust the data used to train the AI they work with. A Fivetran survey of 550 executives in large organizations estimates that organizations lose on average 6% of their annual revenues, or $406 million, due to underperforming AI models (that are built using inaccurate or low-quality data), resulting in misinformed business decisions. Organizations leveraging large language models (LLMs) report data inaccuracies and hallucinations 50% of the time.

Also, fixing these deficiencies requires data curation and quality assurance, which eats up a lot of time for people who should be focusing on business problems. "Most data scientists spend time curating or wrangling data vs. creating and testing actual models," Thurai added.

Yet a lot of data is still needed to fuel the AI engine. The challenge is that "when you feed AI and ML models with partial data, you only get a partial view of the enterprise," Thurai explained. "Though enterprises are producing more than enough data, it's still very fragmented between business units, domains, platforms, and implementations such as cloud versus private data centers."

The problem is that organizations are charging head-first into AI. "Many businesses are overly eager to throw technologies at the loudest problem that exists without putting in the hard work, such as addressing underlying data quality issues," Michael Heath, lead technical solutions engineer at SHI International, told ZDNET. "This demands accurate, consistent, and complete data. Without robust data governance and data management practices, organizations risk amplifying errors and generating unreliable insights."

Data governance calls for an all-hands-on-deck effort to ensure that the right data is going to the right people and applications, and that data is timely, relevant, secure, and has value.  

While data quality has been top of mind for years, identifying data that is essential for AI and training models is another challenge. This "quintessential data" -- as defined by Neda Nia, chief product officer for Stibo Systems -- consists of data "that is well governed and truly represents what delivers the most optimal result to train machine learning models," she told ZDNET. 

Also: Do AI tools make it easier to start a new business? 5 factors to consider

Quality matters -- and concerted governance is needed at both the data and AI levels. This creates "the transformative force reshaping data management and delivery in the GenAI era," Alation CTO Junaid Saiyed said. "The rapid pace, vast scale, and intricate complexity of data processing in GenAI demands robust AI governance frameworks. Organizations can overcome the garbage in, garbage out dilemma with effective AI governance."

Of course, high-quality data doesn't appear out of nowhere. "The main challenge in maintaining high-quality data lies in the unpredictable nature of requirements," Nia said. "Questions include 'What constitutes AI-ready data?' 'Which future models will need specific data?' and 'How far back should data be retained for optimal processing in models?'"

People working with AI need to consider "the established requirements set by compliance and regulation, while also anticipating future data science needs, including those yet to be defined," Nia elaborated. "This poses a significant challenge -- how can we anticipate future requirements in a constantly changing environment?"

Also: Can governments turn AI safety talk into action? 

Well-governed, quality data needs to be ready and available for all scenarios, she continued. "Invest and focus on such data. While data volume is important, quality outweighs volume in the modern world."

AI and data governance "ensures that AI models operate on clean, relevant, and reliable data," Saiyed said. "This enhances the accuracy and fairness of AI decisions, promotes effective collaboration through metadata management, and ensures compliance with increasing regulatory demands."

Data governance also helps "establish a culture of data integrity, so organizations can drive innovation, operational efficiency, and growth," Saiyed said.

Editorial standards