In search of the missing piece of generative AI: Unstructured data

Enterprises have long wrestled with unstructured data. Now, they have another reason to pursue it -- to support and be supported by AI.
Written by Joe McKendrick, Contributing Writer
A photo of empty cubicles with a robot sitting in the middle
Getty Images/Westend61

In recent years, the spotlight has been on unstructured data -- text, graphics, documents, IoT streams -- all streams of data that hold tremendous, untapped value. The database industry underwent a continent-size shift to better accommodate and hopefully surface these assets. 

Also: What is generative AI and why is it so popular? Here's everything you need to know

Often, a lack of awareness of truly hidden unstructured data sources or assets frustrated these efforts. While it is estimated that 90% of the information across enterprises is unstructured data, only 46% of organizations have made efforts to extract its value, according to an IDC survey

Now, technology and business leaders have another reason for pursuing and surfacing unstructured data: The rise of generative artificial intelligence

The companies and IT professionals that pushed themselves forward with unstructured data in recent years may find themselves in a better position to take advantage of generative AI -- and, conversely, employ AI to dig deeper into data stores. 

It's time for enterprises to step up "management of unstructured data from sources such as IoT, as well as knowledge documents -- PowerPoints, text, Excel spreadsheets," says Matt Labovich, US data, analytics, and AI leader at PwC. "They all contain valuable institutional knowledge about business operations and hold insights that can be harnessed using gen AI."

While structured data strategies have traditionally received the majority of attention, it's time to turn attention to "the significant role of unstructured data in the advancement of gen AI," Labovich urges. 

While previous AI initiatives had to focus on use cases where structured data was ready and abundant, "the complexity of collecting, annotating, and synthesizing heterogeneous datasets made wider AI initiatives unviable," according to a recent global survey published in MIT Technology Review Insights, underwritten by Databricks. 

"By contrast, generative AI's new ability to surface and utilize once-hidden data will power extraordinary new advances across the organization," writes the report's author, Adam Green

Also: AI is growing into its role as a development and testing assistant

The ability to capture and pull value from such data is considered more critical than ever. Almost 70% of the survey's participating technology executives agree that data problems are the most likely factor to jeopardize their AI and machine learning goals. "Text-generating AI systems, such as the popular ChatGPT, are built on large language models," Green says. "LLMs train on a vast corpus of data to answer questions or perform tasks based on statistical likelihoods."

AI applications "rely on a solid data infrastructure that makes possible the collection, storage, and analysis of its
vast data-verse," Green adds. "Even before the business applications of generative AI became
apparent in late 2022, a unified data platform for analytics and AI was viewed as crucial by nearly 70% of our survey respondents."

More than two-thirds of survey respondents agree that unifying their data platforms for analytics and AI is crucial to their enterprise data strategies. The generative AI era requires a data infrastructure that is flexible, scalable, and efficient. The key is to "democratize access to data and analytics, enhance security, and combine low-cost storage with high-performance querying."

Pulling together unstructured data for today's AI is no overnight task. "Mergers and acquisitions have resulted in fragmented IT architectures. Important documents, from research and development intelligence to design instructions for plants, have been lost to view, locked in offline proprietary file types," Green points out in the MIT report. 

Also: The promise and peril of AI at work in 2024, according to Deloitte's Tech Trends report

"Could we interrogate these documents using LLMs? Can we train models to give us insights we're not seeing in this vast world of documentation?" 

According to Andrew Blyton, vice president and chief information officer of Incyte, and former VP of DuPont Water & Protection, "We think that's an obvious use case. Language models promise to make such unstructured data much more valuable."

Bringing data owners, analysts, and users into the process from across the business is also key to data success with gen AI. "It's not solely the responsibility of the CIO," says Labovich. "Business leaders must take charge, while the CIO enables and supports the process. Operational readiness and change management are key, which involves having executives across the business actively participating in the identification of critical data, embedding into workflows, and assuming the role of change champions to foster widespread adoption." 

Editorial standards