In the Big Data world, we're used to accepting compromises and arbitrary limitations. For example, accepting the benefits of Hadoop means working with the Hadoop Distributed File System, its immuatable files, its technique of keeping three copies of everything by default, and availability/reliability issues of the Hadoop cluster's namenode. As a further illustration of fixed limtations, iterative report design, even in a self-service scenario, almost always requires an open, persistent connection to the data source(s).
We accept these limitations based on an intuitive sense of needing to give in order to get, but that doesn't mean we have to like it, or meekly accept it. What if we could transcend these limitations and still get our work done? Today, in separate product launches, Cleversafe and Jaspersoft seek to provide such breakthroughs.
Cleversafe swaps out HDFS
Assuming it works as advertised, Cleversafe's company name is a fair reflection of its Hadoop architecture. While other HDFS alternatives exist for Hadoop (for example, MapR's Hadoop distro, which can mount HDFS-compatible NFS volumes), Cleversafe's Slicestor appliance nodes retain HDFS' distributed nature and maintain fault tolerance too. Cleversafe does this with "information dispersal" slices: spreading the data around different nodes in the cluster, employing Erasure Coding -- a scheme that allows reconstruction of data from a subset of storage nodes, and eliminates single points of failure without the overhead of HDFS' complete replication.
Meanwhile, the data is also stored in conventional format on the nodes where it is expected to be used for computation. The conventional storage assures fast MapReduce operations, and the striped storage assures fault tolerance, without the need (and network traffic and management overhead) to keep multiple full copies of the data.
Namenode issues disappear as well, since a Cleversafe cluster's accesser nodes federate and cover for each other, and the meta data is split up along with the data itself. Although various high availability namenode technologies are appearing in the major Hadoop distributions now, they nonetheless still use a single central namenode at any given time. Keeping a warm spare around is not the same thing as having meta data/directory services responsibilities shared among a collection of active nodes.
Although Cleversafe clusters are appliance-based, the appliances nonetheless use commodity processors and storage. The added value comes from tuning and optimization, and the unique storage software subsystem. Cleversafe storage runs about $500 per Terabyte, and can be less depending on total storage size. On the MapReduce side, Cleversafe uses Cloudera's Distribution Including Apache Hadoop (CDH).
Jaspersoft: we don't need no stinkin' connections
While Cleversafe seeks to liberate data specialists form the tyranny of the HDFS namenode, Jaspersoft seeks to do likewise for end-users with respect to original data sources. With its new 4.7 release, Jaspersoft has really focused on the reporting scenario and has taken the position that modifying the design of a report shouldn't require going back to the server, if the report already has the data it needs.
Jaspersoft reports now carry with them a full offline snapshot containing the data set, the original query and the formatting information. From there, users can take advantage of Jaspersoft's browser based report tooling as if they were working in a connected capacity -- the only difference is that they'll be querying the offline cache.
What's especially interesting here is that this disconnected cache interactivity capability is to be included in Jaspersoft's free, open source Community Edition. This opens up interesting, royalty-free embedding opportunities for developers. And given that the Community Edition, according to Jaspersoft, is often used to build reports on transactional databases, the availability of the offline snapshot cache will provide end-users with a datamart of sorts, thus easing stress on the production database.
On the Mobile side, Jaspersoft is introducing a native Android app for smartphones running that operating system. The Android native smartphone app joins the native iPhone app Jaspersoft already had on offer. For the tablet form factor, Jaspersoft is sticking with the browser and HTML 5.
As we saw with the many annoucements around last month's Hadoop Summit, Big Data companies are working hard to bring Hadoop up to Enterprise quality expectations and the NoSQL and Open Source BI companies are working hard to make their layers stack up as well. As these Enterprise efforts have progressed, so many point solutions have emerged that there is now some risk of fragmentation in the platform.
But I think the likely scenario is one of evolution, where the best new approaches to storage, high availability and batch/online moderation (from amongst the many permutations proffered), will be widely adopted and the less popular approaches may fade away.
This is a normal part of software maturity and an overall good sign for Big Data.