LinkedIn said it will open source an internal application called WhereHows, which is a data mining portal for enterprise information.
Technically, LinkedIn calls WhereHows "a data discovery lineage portal." From a business perspective, WhereHows is designed to surface data from multiple stores via metadata.
According to LinkedIn, WhereHows has captured the status of 50,000 datasets, 14,000 comments and 35 million job executions good for a storage footprint topping 15 petabytes.
In a blog post, LinkedIn outlined the reasons it built WhereHows--its big data ecosystem was too diversified with multiple applications designed to do one specific job. As a result, LinkedIn has everything from Informatica to Spark to Hive to Oracle to Hadoop to Teradata as well as a bevy of schedulers. LinkedIn said:
LinkedIn has accumulated a lot of diversity in its big data ecosystem. We have many different sources and sinks of data. We write production pipelines that are driven by different scheduling engines, and we support many different transformation engines that are used to process and create derived data. This sort of specialization is nice because it gives us access to the best tool for the job; however, it creates a new set of problems. It becomes much harder to make sense of the overall data flow and lineage across the different processing frameworks, data platforms, and scheduling systems. This can result in a host of challenges including loss in productivity for employees as they try to find the right datasets to derive insights, operational challenges in discovering and triaging data breakages as well as lost opportunities in discovering and eliminating redundant computation.
Enterprises can relate. Like most companies, LinkedIn had a data warehouse team to aggregate data. The issue is that the data proliferated as did the systems.
WhereHows integrates LinkedIn's data processing software and takes metadata. Then, the application surfaces it through a Web app with vital details and an application programming interface (API). The parts of WhereHows includes a repository, Web server and backend server that grabs metadata from other systems.
Here's the architecture:
Many enterprises will recognize the systems and issues. LinkedIn has Hadoop, multiple databases, Teradata and various specific applications. The company could move it all to one place, but by the time that project is finished it'll be legacy.
LinkedIn's hope is that the open source community can add to WhereHows, fix bugs and further the metadata ball.