In the world of Big Data, there's a lot of talk about unstructured data -- after all, "variety" is one of the three Vs. Often these discussions dwell on log file data, sensor output or media content. But what about data on the Web itself -- not data from Web APIs, but data on Web pages that were designed more for eyeballing than machine-driven query and storage? How can this data be read, especially at scale? Recently, I had a chat with the CTO and Founder of Kapow Software, Stefan Andreasen, who showed me how the company's Katalyst product tames data-rich Web sites not designed for machine-readability.
Scraping the Web If you're a programmer, you know that Web pages are simply visualizations of HTML markup -- in effect every visible Web page is really just a rendering of a big string of text. And because of that, the data you may want out of a Web page can usually be extracted by looking for occurrences of certain text immediately preceding and following that data, and taking what's in between.
Code that performs data extraction through this sort of string manipulation is sometimes said to be performing Web "scraping." This term that pays homage to "screen scraping," a similar, though much older, technique used to extract data from mainframe terminal screen text. Web scraping has significant relevance to Big Data. Even in cases where the bulk of a Big Data set comes from flat files or databases, augmenting that with up-to-date- reference data from the Web can be very attractive, if not outright required.
Unlocking Important Data But not all data is available through downloads, feeds or APIs. This is especially true of government data, various Open Data initiatives notwithstanding. Agencies like the US Patent and Trademark Office (USPTO) and the Federal Securities and Exchange Commission (SEC) have tons of data available online, but API access may require subscriptions from third parties.
Similarly, there's lots of commercial data available online that may not be neatly packaged in code-friendly formats either. Consider airline and hotel frequent flyer/loyalty program promotions. You can log into your account and read about them, but just try getting a list of all such promotions that may apply to a specific property or geographic area, and keeping the list up-to-date. If you're an industry analyst wanting to perform ad hoc analytical queries across such offers, you may be really stuck.
Downside Risk So it's Web scraping to the rescue, right? Not exactly, because Web scraping code can be brittle. If the layout of a data-containing Web page changes -- even by just a little -- the text patterns being searched may be rendered incorrect, and a mission critical process may completely break down. Fixing the broken code may involve manual inspection of the page's new markup, then updating the delimiting text fragments, which would hopefully be stored in a database, but might even be in the code itself.
Such an approach is neither reliable, nor scalable. Writing the code is expensive and updating it is too. What is really needed for this kind of work is a scripting engine which determines the URLs it needs to visit, the data it needs to extract and the processing it must subsequently perform on the data. What's more, allowing the data desired for extraction, and the delimiters around it, to be identified visually, would allow for far faster authoring and updating than would manual inspection of HTML markup.
An engine like this has really been needed for years, but the rise of Big Data has increased the urgency. Because this data is no longer needed just for simple and quick updates. In the era of Big Data, we need to collect lots of this data and analyze it.
Making it Real Kapow Software's Katalyst product meets the spec, and then some. It provides all the wish list items above: visual and interactive declaration of desired URLs, data to extract and delimiting entities in the page. So far, so good. But Katalyst doesn't just build a black box that grabs the data for you. Instead, it actually exposes an API around its extraction processes, thus enabling other code and other tools to extract the data directly.
That's great for public Web sites that you wish to extract data from, but it's also good for adding an API to your own internal Web applications without having to write any code. In effect, Katalyst builds data services around existing Web sites and Web applications, does so without required coding, and makes any breaking layout changes in those products minimally disruptive.
Maybe the nicest thing about Katalyst is that it's designed with data extraction and analysis in mind, and it provides a manageability layer atop all of its data integration processes, making it perfect for Big Data applications where repeatability, manageability, maintainability and scalability are all essential.
Web Data is BI, and Big Data Katalyst isn't just a tweaky programmer's toolkit. It's a real, live data integration tool. Maybe that's why Informatica, a big name in BI which just put out its 9.5 release this week, announced a strategic partnership with Kapow Software. As a result, Informatica PowerExchange for Kapow Katalyst will be made available as part of Informatica 9.5. Version 9.5 is the Big Data release of Informatica, with the ability to treat Hadoop as a standard data source and destination. Integrating with this version of Informatica makes the utility of Katalyst in Big Data applications not merely a provable idea, but a product reality.