Dispelling the "big data" myth
Summary: Big data is a silly term. Data has always been big. Instead of big rhetoric, we need a big solution.
In the last six months or so, I don't think a week has passed that I don't read a headline about "big data" or receive an email about someone wanting to speak with me about their company that supports "big data."
It's enough to make me want to get up out of my comfy chair and yell, "STOP IT. THERE'S NO SUCH THING AS BIG DATA. DATA HAS ALWAYS BEEN 'BIG.'" Seriously, I'm sick of buzzterms and those who latch onto them. And, "Big Data" is the latest one to "get stuck in my craw" as we'd say in Texas.
Data, for those of you who don't realize it, has always been big.
Here's the definition of big data, conveniently lifted from Wikipedia:
"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.
Here are a few key "big data" points that I want you to remember, so I'm going to bullet point them for you.
- Data has always been big.
- Large data sets are difficult to maneuver, backup, copy, move and manage.
- Traditional relational databases (RDBMSs) have practical limitations.
- The cost of managing huge data sets is extreme.
- There's a solution.
When standard server system disk drives were 90MB (Yes, megabytes), "big data" was in the gigabyte range. It was no less cumbersome in 1987 than it is now. Gigabyte-sized disks were expensive. WORM drives (Now, known as CD-R or DVD-R) only held 600MB each, the drive cost $3,000 and each disk cost $30. Data was big then. I should know. I produced a full CD of data per week from my HP GC/MS* at a laboratory in Dallas. It doesn't take long to tally up the amount of data I produced in a single month.
That was "big" data.
We didn't call it that. We called it "a lot of data." I think we might have interjected at least one expletive into the sentence for emphasis but it was big data.
That volume of data was hard to work with. It was nearly impossible to copy anywhere for analysis and it was expensive to store. It makes me wish I'd said, "Hey Don (My boss's name), what we have here is big data." My guess is that a beaker of Methylene Chloride would have sailed toward my head for that much silliness.
I did say that there's a solution to "big data." Here it is.
Simply stated, "You're doing it wrong."
If your data, like mine, is so large that you can't manage it efficiently or successfully, you're doing it wrong.
The solution is to rethink and rework data storage technology.
The days of using a single RDBMS to manage all that data are gone. When I really consider it, I'm not sure they ever existed.
Here are my suggestions for managing your "big" data (bullet pointed for your convenience):
- Archive unused or little used data.
- Use traditional RDBMSs for transactional data.
- Use NoSQL for those large, non-shrinkable data volumes.
- Tier your storage to maximize cost efficiency.
- Split data sets into manageable chunks based on function or need.
- Store data in more efficient formats to save space and speed queries.
- Don't over-normalize data.
- Use Indexes.
- Query subsets or representational data sets.
- Use disk-to-disk backup for speed and reliability.
- Use the best available technology for storage, retrieval and networking.
The bottom line is that we have to change our technology to accommodate so-called big data. We have to change our storage technology, our database structure, our network architecture and our retrieval methods. No, it isn't easy but what we're doing doesn't work. It's never really worked. Filesystems aren't setup to accommodate huge databases, spanning disk volumes is a dangerous storage method and operating systems aren't up to the task of addressing such huge streams of data on current network infrastructure.
And, you can't keep scaling with our current hardware and software. Theoretical limits say we can but practical limits say that we can't.
We need to explore chip-based storage. We need to develop new data types. We need to research new filesystems and we need new data compression and delivery protocols. We need an RDP or MetaFrame type technology for data delivery across the wire. We have to think differently because data is big.
Data has always been bigger than we can effectively handle. The myth of big data is that it's something new. It isn't. Big data has been with us from the very beginning of digitized information storage and retrieval. We need to face it without buzzwords and marketing hype. We need to face the myth and work on resolving the reality.
What do you think of big data? Do you think there's a solution, and if so, what is it?
*Gas Chromatograph/Mass Spectrometer--I used to be a Chemist and the GC/MS was the best analytic instrument in the lab for organic compounds. Tuning it was a pain but the analysis it provided was a dream come true. I could go on and on.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Next to Useless Article
???Archive unused or little used data.
Then how am I going to do a regression analysis on historical data elements? What if I am analyzing census data? "unused" or "little used" data should be properly crafted into a data cube, not written off.
???Split data sets into manageable chunks based on function or need.
Ahh....so now we are nibbling at the edges of data cubes.
???Store data in more efficient formats to save space and speed queries.
Meh again. Although you could say this is a comment about cubes, you could also be referring to storage.
???Don???t over-normalize data.
Welllllll....yes..... Big Data is not about OLTP, it is about analysis. Again: nibbling at the edges of data cubes.
???Query subsets or representational data sets.
This goes back to setting up your data cubes appropriately.
Big data isn't about data volume????
It's all about the complexity
I think you missed something somewhere.
As to your database solutions they are incredibly linear, linear solutions will not work, try a different perspective.
I Disagree With the Title, In Some Respects
Sorting is O(n log 10) meaning sorting 10 million records takes 13 times as long as sorting 1 million records.
Mapping, that is taking a data object and transforming it to another data object, is O(n), so ten times more records take ten times as much time.
Folding, or reducing, which takes a list of objects and reduce them to a single value is also O(n).
Moore's law was sufficient to allow single processor/storage systems to scale with data. I suspect that one sees an inflection point and slope increase with regards to data per user with the internet in the mid-90s.
So, very large datasets have to be split up, which imposes latency costs via time costs in network control and communication and via the overhead to divide data and recombine the results. With very large distributed processors, the probability of failure during a process increases, and the controller has to have the ability to requery a subset and the retries add time to a job.
Further speed-ups are possible via caching, but caching also adds overhead in terms of storage to contain the caches, time to check if a value is in the cache, and some form of management in order to be sure that the cache is not stale when it matters if it is.
All of this added infrastructure is not necessary in the domains for which I wrangle data. Access would be powerful enough, because the data sets are in the hundreds and not the billions. I actually use postgresql which has more power than I need, but provides more scalability than my clients will need and can be deployed on all the server platforms.
I could transform the dataset management into these large-scale NoSql / map-reducing frameworks - and I have been looking into them - but I'd be spending more time writing for moving targets as SQL and RDBMS's are well-understood, but the new tools are still figuring out where the sweet spots of interfaces, cost/benefit and CAP (http://en.wikipedia.org/wiki/CAP_theorem) are found. But, I see this as adding hours to my development time and the result would be achieving 0.33 seconds in response time instead of 1.2 seconds for internal applications. I work with small businesses, they don't need and won't pay for the jet-powered Pregnant Guppy (http://en.wikipedia.org/wiki/Aero_Spacelines_Pregnant_Guppy) for their data transport.
I understand where we agree. Back in the day, the slow-and-steady mainframe, which took up half the second floor, did the payroll overnight every other week. People did plan their processes around the limits of the 100% capacity jobs, and as the limits have increased, the size of the routine jobs increased. Even with the new hardware, programmers quickly found themselves at the edge of the flat world, applying all their creativity to keep the ship from going over into the realm of dragons. Think about weather forecasting which gets better with more and more comprehensive station reports. Seven day forecasts? Back when I was a young grown-up, unimaginable. And, I imagine today, meteorologists have a firm grasp on which modeling jobs are out of reach.
But my point is that something like the CAP theorem wouldn't even be thought of, except that the largest datasets are now magnitudes larger than pre-internet datasets and in constant flux. However the typical (plus/minus one standard deviation) datasets are still manageable without adding the complexities of distributed processing.
If vendors and customers wish to label the suite of tools for handling tera- and peta-record datasets as BigData, to distinguish from the stuff I need (for which the tools are now given away), then I see why. These are not my grandfather's databases (which would have been filing cabinets.)
"traditional" relational database
So called NoSQL systems achieve their performance by going back to pre-relational methods with all their inflexibility, unreliability and lack of intellectual scalability (doing something remotely difficult swiftly mires you in extreme complexity).
It's NoSQL that's traditional not RDBMSs.
You do not get it
Big Data might not be the most descriptive term, but ...
- Your examples of past data volume issues are close to irrelevant today, with cheap cloud based storage. Many people can now literally keep close to everything.
- The ability to define structure as needed is novel and useful. I don't have to know what I want to do with something when I decide to keep it.
- The ability to operate on such a massive data sink (via MapReduce) turns the DW world on it's head.
Up until now, the amount of data we keep around (eventually in the warehouse) has shrunk over time due to costs. With insanely cheap storage and processing, that no longer need be true.
Big Data is....
Data Has Never Been This Big
You might not like the term, but the problem is very real--and getting worse.
The future is relational
Amazon needs reliable order processing so they use an RDBMS.
Moore's law applies here just like everywhere else.
Better implementations of RDBMSs could give far better performance without sacrificing the huge advantages of the relational model.
The disadvantages to reverting to the old-fashioned methods advocated by the NoSQL school (hierarchical, graph or hash table based) far outweigh the arguable performance advantages.
In any case NoSQL tools are fast if you happen to want to access the data from one particular direction, but desperately slow and fiendishly complicated should you wish to get at the data in a way the system designer hasn't allowed for.
Big Data methods are a passing fad, the future is relational!
Question
A good point.
Big Data Yup
Big Data is Bigger Than Big
Fortunately for the first time in IT history, and I wrote my first line of commercial code 39 years ago so have seen a fair amount of that history, we actually have (a) the volume and richness of data, (b) the technology to not just process/manage the data but to map it, integrate it, govern it, and visualize it and learning engines to help with all of this and (c) improving data analyst skills and tools - i.e. the ability to communicate about data in at least a semi-formal language that crosses IT and business lines.
All of these evolutions have come together in about the same rough timeframe, and this has reignited the "BI/Analytics" space really for the first time in a few decades. Big Data is a perfect storm, and it is transforming IT, business and the way IT and business work together. Some of the Big Data projects to date have saved lives and transformed businesses, and is the next battleground between Web 2.0 businesses and brick-and-mortar style companies.
Yes, there is hype, there are too many providers, there will be consolidations, and there is an aspect of Big Brother to Big Data. And there is stumbling around, not every project yields hoped for results, it requires new skills and new thinking and a learning curve. But Big Data is not a promise, it is already here, albeit still in the relatively early stages. And all kinds of databases and analytical platforms and clusters and nodes can participate.
Big Data is absolutely not a myth, it is indeed the next quantum level for BI/analytics, what those disciplines deliver and how they impact life and business.
Coudnt Agree more
Here are my initial thoughts on Big Data .
http://trackingbigdata.blogspot.com/
@sanskamat