K is for Knowledge: Application and data integration for better business, using metadata and knowledge graphs
Being disrupted by Big Tech is one of the greatest concerns for any business. Good news: There may be a path to accelerate digital transformation and out-compete Big Tech, by leveraging domain knowledge.
First, you get the software: Operating systems, search engines, browsers, and social networks. Then, you get the hardware: Mobile phones, data centers, cloud. Then, you get a gradually expanding foothold in just about anything from advertising and media to healthcare and from autonomous vehicles to banking.
In this process, Big Tech has managed to amass money and power, building its ruthless efficiency on data-driven culture and products. The awe this has instilled on businesses has been captured by a pop culture reference to a Game of Thrones series episode called the Red Wedding.
Today, every business is a technology business, in the sense that it runs on technology. Unlike Big Tech, however, most businesses have a surplus of legacy systems and a deficit of tech talent. This makes modernization risky and costly. Most businesses can't afford to rip and replace systems built over the years. The architecture may be outdated, but the business logic is tried and true.
How (not) to out-tech Big Tech
So, what are businesses to do? Sit and wait to be disrupted, invest huge amounts in modernization efforts, try to out-tech Big Tech? None of these sounds like a very good solution. But there may be another option to win this battle.
First and foremost, every business needs to become the best possible version of itself by leveraging their competitive advantage: Domain business knowledge. This was the most important takeaway from one of the most forward-thinking events in Europe, the Big Things conference.
We have referred in the past to the path from Big Data to AI, again based on observations made at the conference. This year, the event itself evolved along this path, rebranding as Big Things, and giving the stage to an array of speakers from organizations big and small alike.
Google was among those, as Cassie Kozyrkov, chief decision scientist at Google, keynoted the event. Kozyrkov offered an excellent blueprint on how to use machine learning for data-driven decision making. One of the many points made was that without trusted data, this is a non-starter. No trusted data means no data-driven decision making, which means no efficiency.
In other words: If your data is a mess, it's going to kill your business. This was the starting point for Oscar Mendez's keynote. Mendez, who is the CEO and co-founder of Stratio, defined trusted data as data that is clean, secure, accurate, organized, and have well-defined origins and clear access guidelines.
As Mendez puts it, Big Tech monitors interactions, collects data, and learns something all the time. Most other businesses don't. But this goes beyond the cold start problem. Many businesses have started collecting data, and legacy systems are huge data troves, too. But how do you get from zero to trusted data?
Data governance is one part of the answer. Things such as data lineage, access control, and metadata enrichment fall under data governance. In that respect, businesses that listened to the GDPR wake-up call and put data governance processes and systems in place should already be better positioned to deal with these issues.
Virtualization, meaning, semantics, ontologies
Another part of the answer, Mendez argued, is virtualization. With an array of systems in place, each generating data in its own format and storing it in its own silo, how can businesses ever hope to have a holistic, integrated picture?
Mendez's proposed solution to this combines data catalogs and virtualization to create what is called a trusted data fabric. What this means is that data stays where it is, and accessing it happens via the fabric layer, utilizing the data catalog to point to the underlying systems of record.
This conceptual architecture does not exclude actual data movement when necessary. There is, however, something missing: Meaning, or semantics, for the data. Often, the meaning of the underlying data is ill-defined or entirely missing. Until recently, application development was the main concern for business, and data was not a first-class citizen.
Combined with the typical churn and project delivery environment in businesses, this results in cutting corners in documenting data. This, in turn, results in not knowing where your data is, what it means, and how it maps to each other and business concepts. Mendez had something to propose to remedy this, too: ontologies.
His argument is a compelling one. CDOs, data stewards, business users, they all have to put lots of effort into cataloging data. Doing this manually is error-prone and does not scale. In addition, oftentimes, by the time the effort is complete, it has to start all over again, because the data landscape has changed. Why not put in most of the effort once, and reuse it?
The most reusable and sophisticated way to do this, as per Mendez, is by using business terms and building an ontology that captures the domain and the expertise of the business. Thus, formal definitions of business terms can be created, which can then be used for matching, machine learning, and other purposes. Adding semantics to data can go a long way.
Ontologies are digital artifacts that capture data meaning and relationships in a reusable way. You can think of them as data schemas on steroids, bringing an array of advanced capabilities with them. For an example of a relatively simple, but powerful and widely used ontology, you can look at schema.org, which is used to classify content on the web and beyond.
K is for Knowledge: ontologies and knowledge graphs
Mendez shared demos and use cases of how Stratio is using this in production to automate data mapping, as well as feature selection for machine learning. He referred to a client in finance, for which Stratio used this approach to develop a product released worldwide in 25% of the time and 20% of the cost it was originally estimated.
Having extensively used, and written about, these technologies ourselves, we feel there is a final point to be made here. In Big Things, like pretty much everywhere for the last couple of years, the term Knowledge Graph was used extensively. To name just a couple of examples, CaixaBank and Intel presented related work.
Originally, Knowledge Graphs and ontologies were near-synonyms. But like all hyped terms, you can expect the term Knowledge Graph to be used ad nauseam, to the point where it becomes meaningless. Lately, the term has also been expanded to include non-ontological approaches, going by the name of property graphs. These are related, and they do have benefits.