K is for Knowledge: Application and data integration for better business, using metadata and knowledge graphs

Being disrupted by Big Tech is one of the greatest concerns for any business. Good news: There may be a path to accelerate digital transformation and out-compete Big Tech, by leveraging domain knowledge.

Big data: Three ways to ensure information is turned into insight

First, you get the software: Operating systems, search engines, browsers, and social networks. Then, you get the hardware: Mobile phones, data centers, cloud. Then, you get a gradually expanding foothold in just about anything from advertising and media to healthcare and from autonomous vehicles to banking.

In this process, Big Tech has managed to amass money and power, building its ruthless efficiency on data-driven culture and products. The awe this has instilled on businesses has been captured by a pop culture reference to a Game of Thrones series episode called the Red Wedding.

In the series, Red Wedding refers to a massacre. The metaphor has been used to describe the effect AWS announcements have on software businesses that see AWS enter their turf. The software business has been the first to feel Big Tech's effect, but it does not look like it will be the last.

Today, every business is a technology business, in the sense that it runs on technology. Unlike Big Tech, however, most businesses have a surplus of legacy systems and a deficit of tech talent. This makes modernization risky and costly. Most businesses can't afford to rip and replace systems built over the years. The architecture may be outdated, but the business logic is tried and true.

How (not) to out-tech Big Tech

So, what are businesses to do? Sit and wait to be disrupted, invest huge amounts in modernization efforts, try to out-tech Big Tech? None of these sounds like a very good solution. But there may be another option to win this battle.

First and foremost, every business needs to become the best possible version of itself by leveraging their competitive advantage: Domain business knowledge. This was the most important takeaway from one of the most forward-thinking events in Europe, the Big Things conference.

We have referred in the past to the path from Big Data to AI, again based on observations made at the conference. This year, the event itself evolved along this path, rebranding as Big Things, and giving the stage to an array of speakers from organizations big and small alike.

Google was among those, as Cassie Kozyrkov, chief decision scientist at Google, keynoted the event. Kozyrkov offered an excellent blueprint on how to use machine learning for data-driven decision making. One of the many points made was that without trusted data, this is a non-starter. No trusted data means no data-driven decision making, which means no efficiency. 

mlgcp.jpg

Machine learning is all about the data. Garbage data in means garbage insights out. (Image: Cassie Kozyrkov/Google)

In other words: If your data is a mess, it's going to kill your business. This was the starting point for Oscar Mendez's keynote. Mendez, who is the CEO and co-founder of Stratio, defined trusted data as data that is clean, secure, accurate, organized, and have well-defined origins and clear access guidelines.

As Mendez puts it, Big Tech monitors interactions, collects data, and learns something all the time. Most other businesses don't. But this goes beyond the cold start problem. Many businesses have started collecting data, and legacy systems are huge data troves, too. But how do you get from zero to trusted data?

Data governance is one part of the answer. Things such as data lineage, access control, and metadata enrichment fall under data governance. In that respect, businesses that listened to the GDPR wake-up call and put data governance processes and systems in place should already be better positioned to deal with these issues.

Virtualization, meaning, semantics, ontologies

Another part of the answer, Mendez argued, is virtualization. With an array of systems in place, each generating data in its own format and storing it in its own silo, how can businesses ever hope to have a holistic, integrated picture?

Mendez's proposed solution to this combines data catalogs and virtualization to create what is called a trusted data fabric. What this means is that data stays where it is, and accessing it happens via the fabric layer, utilizing the data catalog to point to the underlying systems of record. 

This conceptual architecture does not exclude actual data movement when necessary. There is, however, something missing: Meaning, or semantics, for the data. Often, the meaning of the underlying data is ill-defined or entirely missing. Until recently, application development was the main concern for business, and data was not a first-class citizen.

Combined with the typical churn and project delivery environment in businesses, this results in cutting corners in documenting data. This, in turn, results in not knowing where your data is, what it means, and how it maps to each other and business concepts. Mendez had something to propose to remedy this, too: ontologies

diapositiva4.jpg

You can't out-tech Big Tech. But you can out-knowledge them in your specific business domain. (Image: Oscar Mendez/Stratio)

His argument is a compelling one. CDOs, data stewards, business users, they all have to put lots of effort into cataloging data. Doing this manually is error-prone and does not scale. In addition, oftentimes, by the time the effort is complete, it has to start all over again, because the data landscape has changed. Why not put in most of the effort once, and reuse it?

The most reusable and sophisticated way to do this, as per Mendez, is by using business terms and building an ontology that captures the domain and the expertise of the business. Thus, formal definitions of business terms can be created, which can then be used for matching, machine learning, and other purposes. Adding semantics to data can go a long way.

Ontologies are digital artifacts that capture data meaning and relationships in a reusable way. You can think of them as data schemas on steroids, bringing an array of advanced capabilities with them. For an example of a relatively simple, but powerful and widely used ontology, you can look at schema.org, which is used to  classify content on the web and beyond.

K is for Knowledge: ontologies and knowledge graphs

Mendez shared demos and use cases of how Stratio is using this in production to automate data mapping, as well as feature selection for machine learning. He referred to a client in finance, for which Stratio used this approach to develop a product released worldwide in 25% of the time and 20% of the cost it was originally estimated.

Mendez is not the only one evangelizing this. Executives from the likes of Morgan Stanley are waking up to the benefits of the ontological approach. In his book Software Wasteland, industry veteran Dave McComb makes a point of how the integration of hairball brings inefficiency via hidden and not-so-hidden costs for businesses and advocates for a similar approach.

To draw on more material from the Big Things conference: Derwen's Paco Nathan referred in his presentation on data governance to organizations such as Lyft, LinkedIn, WeWork, and Uber. These organizations are not only embracing the metadata and knowledge-based approach but also releasing open-source frameworks to facilitate this.

In terms of commercial vendors: Gartner just released its latest Magic Quadrant for Metadata Management Solutions. For the first time, two vendors that leverage this approach (data.world and the Semantic Web Company) are included. As reviewers of McComb's book put it: It's not really a question about whether this is happening; it's more about when. 

diapositiva2.jpg

How do you solve the application and data integration hairball? By leveraging virtualization, metadata, data catalogs, ontologies. (Image: Oscar Mendez/Stratio)

Having extensively used, and written about, these technologies ourselves, we feel there is a final point to be made here. In Big Things, like pretty much everywhere for the last couple of years, the term Knowledge Graph was used extensively. To name just a couple of examples, CaixaBank and Intel presented related work.

Originally, Knowledge Graphs and ontologies were near-synonyms. But like all hyped terms, you can expect the term Knowledge Graph to be used ad nauseam, to the point where it becomes meaningless. Lately, the term has also been expanded to include non-ontological approaches, going by the name of property graphs. These are related, and they do have benefits.

Their benefits, however, are somewhat different at this point. Data integration and virtualization, for example, is not necessarily their strongest point. Again, McComb has outlined some differences between property graphs and knowledge graphs. In broad strokes, property graphs are more suitable for analytics. Knowledge graphs are more suitable for integration.

There is an ongoing effort to unify property graphs and knowledge graphs, and we hope to see it come to fruition. Until that happens, however, if you want to benefit from the knowledge-based approach to application development, make sure you choose the most appropriate tool for your use case.