Web Data is Big Data

Web Data is Big Data

Summary: There's a lot of data on the Web crucial to Big Data analyses, but it's not all neatly packaged into feeds and APIs. Kapow Katalyst brings this data into the Big Data fold.

TOPICS: Browser

In the world of Big Data, there's a lot of talk about unstructured data -- after all, "variety" is one of the three Vs.  Often these discussions dwell on log file data, sensor output or media content.  But what about data on the Web itself -- not data from Web APIs, but data on Web pages that were designed more for eyeballing than machine-driven query and storage?  How can this data be read, especially at scale?  Recently, I had a chat with the CTO and Founder of Kapow Software, Stefan Andreasen, who showed me how the company's Katalyst product tames data-rich Web sites not designed for machine-readability.

Scraping the Web If you're a programmer, you know that Web pages are simply visualizations of HTML markup -- in effect every visible Web page is really just a rendering of a big string of text.  And because of that, the data you may want out of a Web page can usually be extracted by looking for occurrences of certain text immediately preceding and following that data, and taking what's in between.

Code that performs data extraction through this sort of string manipulation is sometimes said to be performing Web "scraping."  This term that pays homage to "screen scraping," a similar, though much older, technique used to extract data from mainframe terminal screen text.  Web scraping has significant relevance to Big Data.  Even in cases where the bulk of a Big Data set comes from flat files or databases, augmenting that with up-to-date- reference data from the Web can be very attractive, if not outright required.

Unlocking Important Data But not all data is available through downloads, feeds or APIs.  This is especially true of government data, various Open Data initiatives notwithstanding.  Agencies like the US Patent and Trademark Office (USPTO) and the Federal Securities and Exchange Commission (SEC) have tons of data available online, but API access may require subscriptions from third parties. 

Similarly, there's lots of commercial data available online that may not be neatly packaged in code-friendly formats either.  Consider airline and hotel frequent flyer/loyalty program promotions.  You can log into your account and read about them, but just try getting a list of all such promotions that may apply to a specific property or geographic area, and keeping the list up-to-date.  If you're an industry analyst wanting to perform ad hoc analytical queries across such offers, you may be really stuck.

Downside Risk So it's Web scraping to the rescue, right?  Not exactly, because Web scraping code can be brittle.  If the layout of a data-containing Web page changes -- even by just a little -- the text patterns being searched may be rendered incorrect, and a mission critical process may completely break down.  Fixing the broken code may involve manual inspection of the page's new markup, then updating the delimiting text fragments, which would hopefully be stored in a database, but might even be in the code itself.

Such an approach is neither reliable, nor scalable.  Writing the code is expensive and updating it is too.  What is really needed for this kind of work is a scripting engine which determines the URLs it needs to visit, the data it needs to extract and the processing it must subsequently perform on the data.  What's more, allowing the data desired for extraction, and the delimiters around it, to be identified visually, would allow for far faster authoring and updating than would manual inspection of HTML markup.

An engine like this has really been needed for years, but the rise of Big Data has increased the urgency.  Because this data is no longer needed just for simple and quick updates. In the era of Big Data, we need to collect lots of this data and analyze it.

Making it Real Kapow Software's Katalyst product meets the spec, and then some.  It provides all the wish list items above: visual and interactive declaration of desired URLs, data to extract and delimiting entities in the page.  So far, so good.  But Katalyst doesn't just build a black box that grabs the data for you.  Instead, it actually exposes an API around its extraction processes, thus enabling other code and other tools to extract the data directly. 

That's great for public Web sites that you wish to extract data from, but it's also good for adding an API to your own internal Web applications without having to write any code.  In effect, Katalyst builds data services around existing Web sites and Web applications, does so without required coding, and makes any breaking layout changes in those products minimally disruptive.

Maybe the nicest thing about Katalyst is that it's designed with data extraction and analysis in mind, and it provides a manageability layer atop all of its data integration processes, making it perfect for Big Data applications where repeatability, manageability, maintainability and scalability are all essential.

Web Data is BI, and Big Data Katalyst isn't just a tweaky programmer's toolkit.  It's a real, live data integration tool.  Maybe that's why Informatica, a big name in BI which just put out its 9.5 release this week, announced a strategic partnership with Kapow Software.  As a result, Informatica PowerExchange for Kapow Katalyst will be made available as part of Informatica 9.5.   Version 9.5 is the Big Data release of Informatica, with the ability to treat Hadoop as a standard data source and destination. Integrating with this version of Informatica makes the utility of Katalyst in Big Data applications not merely a provable idea, but a product reality.

Topic: Browser

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Ah yes, "big data"

    Ah yes, "big data" - looking for a needle in a haystack, even though the needle might not exist at all. Nobody said there's a needle there, you're searching for one just because there's a haystack.
    • Depends on what you're looking for

      Corporate databases, if properly analysed, have a lot to say about behavior; but the data analysed have to be relevant to the question being considered. Customer records can say a great deal about how how much people are willing to spend for what, and models based on bank records can be very helpful in determining whether loan applicants will actually be able to afford to pay back what they propose to borrow, but neither is likely to say much about people's political preferences or likelihood to engage in criminal behavior.
      John L. Ries
  • Is Kapow the only one

    Andrew - I thought there are already a bunch of vendors providing such services in addition to Kapow - like BoardReader. It is possible that Kapow is providing some more sophisticated functionality. It would help to do a comparison for the benefit of the readers.
    Trendwise Analytics
    • Is Kapow the Only one

      Also heard about Convertigo, offering transactionnal web scraping and data extraction based on an Eclipse powered visual tool to determine Web data to be handled.
  • real world use cases?

    I used Kapow in the distant past, and loved it for scraping.

    My questions are around how usable something like this is. Let's say that I want to run a search and examine some links/data (google, Kayak, wikipedia, etc) every minute. I now have API access to this search and can throw the results in my storage array/cluster. While I could write some validation tests to make sure everything was peachy, and reasonably maintain them across a handful of sites ... the process totally breaks down if I go to say 100 or 1000 sites.

    While Kapow makes it easier to deal with site changes, issues, etc ...they still require manual intervention.

    Let's use Google as an example. Say I want to track the top 100 adwords sites for a specific search every minute. I need to click through a number of screens, scraping content off each one. Should the navigation or content change at all, I still need to go to Kapow and update my scripts.

    Even when dealing with just one site, I can really lose data quickly if anything changes. Say I get an alert that the scrape had issues, and wake up in the middle of the night and fix it in an hour. I can't go back and get that data.

    In some cases this is fine, or the latency of the scrapes allows for more lag time. Unfortunately, one of the tenants of Big Data is never throwing away data. This applies equally to not collecting in the first place. We generally don't care as much about data, than the deltas in it. This makes consistency king.

    Am I missing something here?
  • Real world use-cases?

    As Founder of Kapow Software I wanted to give a real-world example.
    Our product, Kapow Katalyst, is now used by more than 500 companies all over the world. Many doing 100's of business critical real-time integrations or data extraction/processing jobs. One example is Audi, who are using Kapow Katalyst for real-time integration to data providers, serving real-time, location-based queries directly into their award wining Google Earth based navigation system in their high-end cars, like the Audi A8. All controlled on-demand by the driver of the car. Audi would NEVER provide that if there was any remote chance it would break. So let me suggest you take a new look at Kapow Katalyst today, the product has come a looooong way since the "distant past". Kapow Katalyst embeds a unique purpose-build Browser Engine that connect directly to REST services already embedded in any Web Apps, as-is. There is no scripting in Kapow Katalyst, its all point and click "programming" similar to flow-charts, its shareable, and instantly available as REST, automated jobs or self-service business apps.
    • Real World Use-cases

      Stefan - this sounds interesting. I did not see any case studies on your website. Would it be possible to share this as a document?
      My email: mohan-at-TrendwiseAnalytics-dot-com
  • Great

    Excellent articles with great info. For more articles, please visit: www.biroadwarrior.com