MarkLogic has underlined its intent to increase links between its NoSQL database and open-source big-data platform Hadoop.
The firm's involvement with Hadoop goes back to November 2011 when it launched a Hadoop connector for its XML-based database, which can deal with large volumes of unstructured data in real time.
Now MarkLogic CEO Gary Bloom says that relationship will develop further with future iterations of the database, the engine behind the BBC's London 2012 Olympics website that handled up to 25,000 transactions per second.
"We'll continue to do more technology in our product to interface more tightly with the Hadoop environment — and in particular around tiered storage," he said.
The significance of tiered storage, according to Bloom, is that all data need not reside on the most expensive EMC disk arrays, providing options for companies to adopt storage at various prices and with varying levels of availability.
"So the service levels associated with them would tend to go down a little as you go further out, although some of those lower tiers are evolved to be very high performance," he said.
"What MarkLogic will do is if I have Hadoop and have layered my data across multiple tiers of storage, I can then search across all that data still — I don't care which tier it's in. If some of that data is offline — well, I just won't be searching that data at that point in time. I have the flexibility to stay up and running even if my archived data on a lower tier is not running."
Bloom said the relationship with Hadoop has continued to evolve since the appearance of its connector. That launch was followed by the announcement that MarkLogic could run natively on the Hadoop distributed file system, allowing organisations to use Hadoop's batch processing to load large volumes of data and then have it work with MarkLogic.
"If you don't have a MarkLogic stack above your Hadoop, you essentially have to build your own data-management architecture and search capability because Hadoop by itself is essentially a file system. It's good for storing information on disk and it does it very efficiently," Bloom said.
Hadoop management tools
In February, MarkLogic announced it is distributing Intel's version of Hadoop, together with the management tools that come with it for managing the Hadoop environment.
According to Bloom, a typical application of the MarkLogic-Hadoop combination might be in a Wall Street-type environment or the financial services sector, where in some cases 10 or 20 years of data has to be stored.
"You can't just put it out on a tape backup and say, 'I've got the data'. You want to be able to search it. Then if you go into customers doing social analytics and other things, you're talking about massive volumes of data, and I want to be able to figure out where to target my products and how to drive revenue — well, I need to be able to search all my data and I can now do it with different classes of storage," Bloom said.
"That's what they're hoping to do with Hadoop. The problem that Hadoop has had is it's a really interesting technology, and a lot of people thought Hadoop would solve what's now emerged as the big-data problem. They thought it was a standalone solution and what they're coming to realise is that Hadoop by itself doesn't do all that much," he said.
"It's really a very advanced file system. It gets the data on disk, it efficiently batch-processes and produces the data and does some pre-processing. But once you have it on disk, you have to be able to search it. That's what the MarkLogic search engine and database are for: to create the environment where you can now actually access all that data."
Funding for MarkLogic sales push
Together with financial services, MarkLogic expects to find further customers in the media, government security, insurance fraud detection and healthcare. It recently raised $25m in new funding to finance a marketing and sales push into new vertical markets.
Bloom said MarkLogic is focusing on three technology themes. Along with tiered storage, cloud computing is a priority and creating the management tools for managing burst capacity to meet peak demand.
"If I'm using Amazon, I don't want to go grab 50 nodes for my peak period of processing at 7pm and then come in at 6am and find out that I still have 50 nodes allocated to me, because I'm going to be paying for them," he said.
The third area is adding semantics, exploiting MarkLogic's position as a database and a search supplier.
"Since we provide our customers the search engine and the database, we're going to do a lot of the semantics work. We're going to pre-process a lot of that as we put the data into the database," he said.
"So as the data is being ingested, we're going to be establishing a lot of the semantics capabilities at that time and then do the final part of the semantics processing in our search engine."