In the last six months or so, I don't think a week has passed that I don't read a headline about "big data" or receive an email about someone wanting to speak with me about their company that supports "big data."
It's enough to make me want to get up out of my comfy chair and yell, "STOP IT. THERE'S NO SUCH THING AS BIG DATA. DATA HAS ALWAYS BEEN 'BIG.'" Seriously, I'm sick of buzzterms and those who latch onto them. And, "Big Data" is the latest one to "get stuck in my craw" as we'd say in Texas.
Data, for those of you who don't realize it, has always been big.
Here's the definition of big data, conveniently lifted from Wikipedia:
"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.
Here are a few key "big data" points that I want you to remember, so I'm going to bullet point them for you.
- Data has always been big.
- Large data sets are difficult to maneuver, backup, copy, move and manage.
- Traditional relational databases (RDBMSs) have practical limitations.
- The cost of managing huge data sets is extreme.
- There's a solution.
When standard server system disk drives were 90MB (Yes, megabytes), "big data" was in the gigabyte range. It was no less cumbersome in 1987 than it is now. Gigabyte-sized disks were expensive. WORM drives (Now, known as CD-R or DVD-R) only held 600MB each, the drive cost $3,000 and each disk cost $30. Data was big then. I should know. I produced a full CD of data per week from my HP GC/MS* at a laboratory in Dallas. It doesn't take long to tally up the amount of data I produced in a single month.
That was "big" data.
We didn't call it that. We called it "a lot of data." I think we might have interjected at least one expletive into the sentence for emphasis but it was big data.
That volume of data was hard to work with. It was nearly impossible to copy anywhere for analysis and it was expensive to store. It makes me wish I'd said, "Hey Don (My boss's name), what we have here is big data." My guess is that a beaker of Methylene Chloride would have sailed toward my head for that much silliness.
I did say that there's a solution to "big data." Here it is.
Simply stated, "You're doing it wrong."
If your data, like mine, is so large that you can't manage it efficiently or successfully, you're doing it wrong.
The solution is to rethink and rework data storage technology.
The days of using a single RDBMS to manage all that data are gone. When I really consider it, I'm not sure they ever existed.
Here are my suggestions for managing your "big" data (bullet pointed for your convenience):
- Archive unused or little used data.
- Use traditional RDBMSs for transactional data.
- Use NoSQL for those large, non-shrinkable data volumes.
- Tier your storage to maximize cost efficiency.
- Split data sets into manageable chunks based on function or need.
- Store data in more efficient formats to save space and speed queries.
- Don't over-normalize data.
- Use Indexes.
- Query subsets or representational data sets.
- Use disk-to-disk backup for speed and reliability.
- Use the best available technology for storage, retrieval and networking.
The bottom line is that we have to change our technology to accommodate so-called big data. We have to change our storage technology, our database structure, our network architecture and our retrieval methods. No, it isn't easy but what we're doing doesn't work. It's never really worked. Filesystems aren't setup to accommodate huge databases, spanning disk volumes is a dangerous storage method and operating systems aren't up to the task of addressing such huge streams of data on current network infrastructure.
And, you can't keep scaling with our current hardware and software. Theoretical limits say we can but practical limits say that we can't.
We need to explore chip-based storage. We need to develop new data types. We need to research new filesystems and we need new data compression and delivery protocols. We need an RDP or MetaFrame type technology for data delivery across the wire. We have to think differently because data is big.
Data has always been bigger than we can effectively handle. The myth of big data is that it's something new. It isn't. Big data has been with us from the very beginning of digitized information storage and retrieval. We need to face it without buzzwords and marketing hype. We need to face the myth and work on resolving the reality.
What do you think of big data? Do you think there's a solution, and if so, what is it?
*Gas Chromatograph/Mass Spectrometer--I used to be a Chemist and the GC/MS was the best analytic instrument in the lab for organic compounds. Tuning it was a pain but the analysis it provided was a dream come true. I could go on and on.