ClearStory CEO: How Apache Spark is helping bring analytics to the average Joe

With a new analytics cloud service unveiled earlier this month, CEO Sharmila Mulligan explains how ClearStory's engine is shifting data insights to ordinary users.
Written by Toby Wolpe, Contributor
Sharmila Mulligan: Businesses are missing top-line numbers and competitive threats. Image: ClearStory Data

The misuse of data analytics is well documented — data being shoehorned to back up entrenched views, used selectively in petty corporate infighting, or simply misinterpreted.

But even when done correctly, with a reasonable hypothesis followed by rigorous testing, sometimes the traditional approach can come up short for the businesses employing it, according to Sharmila Mulligan, CEO of Silicon Valley startup ClearStory Data.

Those shortcomings may be because conventional analytics is too narrow and fails to negotiate the sheer volume of diverse data coming in from multiple sources, or at least fails to do so quickly enough.

"This whole notion of you look at data with an hypothesis or an intuition and you keep trying to force the view or dashboard into that — business after business is suffering from having done it that way," she said.

"[They're] literally missing top-line numbers, missing competitive threats, missing all kinds of things because they've constrained their view. Traditional analytic solutions are not really designed for the volumes and that kind of data variety."

Mulligan, who co-founded ClearStory in late 2011, said two approaches are currently running side by side in data analytics.

"There's the long-running analysis that data scientists do, which is you've got something that you need to go and analyse over a longer period to look for the anomalies and patterns before you can actually consider anything," she said.

"That's a data scientist problem: 'Let's keep observing, observing, run a model, run another model' and that whole thing keeps going. It could be an analysis you don't conclude anything out of for six or eight months. But what you do conclude at the end of it could be a tremendous finding."

On the other side is what Mulligan calls fast-cycle converged data analysis, which is about being able to analyse data during the course of each day. Her view is that companies routinely buy in data sources to augment their own and yet lack the resources in manpower and technology to exploit them.

Her company offers a back-end system based on the Apache Spark open-source data analytics cluster framework and a front-end application, which sits on top of up to 24 internal and external sources of data. Last week, ClearStory unveiled its Collaborative StoryBoards cloud service.

First, the back-end engine conducts data inference and profiling to identify dimensions and semantics for data harmonisation — spotting relationships between data sources.

At that point the blended and harmonised data is presented to the user through the front-end application, which enables a group of staff to explore the same data simultaneously, even adding more data without the need for any additional modelling.

"That's where collaboration comes in. On the long-cycle, data scientist stuff, if you send someone something in PowerPoint once in six months or three months, it's OK. But when you're looking at things intra-day and daily, you can't afford to have people looking at inconsistent views," Mulligan said.

"By bringing different people together across the organisation through the front-end app and collaborating in real time on these insights, which they can manoeuvre through themselves, they are able to reach observations they couldn't otherwise reach before. You do away with those traditional rigid pre-constrained views."

This approach involves two types of users: the employees who will ultimately consume and analyse the data, and the data stewards who determine which sources are to be used for this particular regular analysis, whether the data lies in repositories, or external or syndicated feeds.

"Most of the users are business frontline users. They could be mid-office or front-office people and they're the ones who have the business questions and the business problems and are looking through the insights to be able to explore deeper and get to the answers," she said.

"At the back-end and how we do data harmonisation, we have invested a lot in IP we built around Spark. We were involved with Spark when it was still a project at Berkeley and the head of Spark is an adviser. We've put a lot into Spark to enable you to do this very fast back-and-forth analysis because there's no way you can do it unless all that data is sitting in a very efficient in-memory layer."

Mulligan said ClearStory technology is particularly being taken up by companies in consumer packaged goods, media and entertainment, healthcare and retail.

"The factors that contribute to better close rates in a store are a whole variety of data signals — from customer service problems, to parking lots being too full, so fewer get people into the store, to foot traffic by department, to merchandising. There are a lot of factors beyond the typical things you'd think about that contribute to close rate," she said.

Companies such as food-products firm Danone have a number of people looking at a potential supply-chain problem when they detect a drop in sales from the figures coming in from point-of-sale providers.

"They have all the data from across the whole supply chain to understand is it an on-shelf saleability problem, is it that the inventory didn't arrive, is it because the competitor dropped prices 10 cents, is it because the product is expired but it's sitting on the shelf?" Mulligan said.

User governance is an important issue, with data stewards and experts given different rights to those enjoyed by users further down the reporting line.

"This ability to have the right user permissions to be part of a data story and be able to see things as they update — it's a pretty powerful thing. So many companies have told us how management has concluded the wrong thing because they've received six dashboards that were too late or the interpreted the wrong thing or they didn't even see the insight because of the constraint of how IT is setting up the dashboards," she said.

"All of this becomes more real time and collaborative so that all the people who need to, see it. It takes away all that missing information that happens when you otherwise have a very rigid way of how data and information are passed along from person to person."

Read more on big data

Editorial standards