Dispelling the "big data" myth

Dispelling the "big data" myth

Summary: Big data is a silly term. Data has always been big. Instead of big rhetoric, we need a big solution.

TOPICS: Big Data

In the last six months or so, I don't think a week has passed that I don't read a headline about "big data" or receive an email about someone wanting to speak with me about their company that supports "big data."

It's enough to make me want to get up out of my comfy chair and yell, "STOP IT. THERE'S NO SUCH THING AS BIG DATA. DATA HAS ALWAYS BEEN 'BIG.'" Seriously, I'm sick of buzzterms and those who latch onto them. And, "Big Data" is the latest one to "get stuck in my craw" as we'd say in Texas.

Data, for those of you who don't realize it, has always been big.

Here's the definition of big data, conveniently lifted from Wikipedia:

"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

Here are a few key "big data" points that I want you to remember, so I'm going to bullet point them for you.

  • Data has always been big.
  • Large data sets are difficult to maneuver, backup, copy, move and manage.
  • Traditional relational databases (RDBMSs) have practical limitations.
  • The cost of managing huge data sets is extreme.
  • There's a solution.

When standard server system disk drives were 90MB (Yes, megabytes), "big data" was in the gigabyte range. It was no less cumbersome in 1987 than it is now. Gigabyte-sized disks were expensive. WORM drives (Now, known as CD-R or DVD-R) only held 600MB each, the drive cost $3,000 and each disk cost $30. Data was big then. I should know. I produced a full CD of data per week from my HP GC/MS* at a laboratory in Dallas. It doesn't take long to tally up the amount of data I produced in a single month.

That was "big" data.

We didn't call it that. We called it "a lot of data." I think we might have interjected at least one expletive into the sentence for emphasis but it was big data.

That volume of data was hard to work with. It was nearly impossible to copy anywhere for analysis and it was expensive to store. It makes me wish I'd said, "Hey Don (My boss's name), what we have here is big data." My guess is that a beaker of Methylene Chloride would have sailed toward my head for that much silliness.

I did say that there's a solution to "big data." Here it is.

Simply stated, "You're doing it wrong."

If your data, like mine, is so large that you can't manage it efficiently or successfully, you're doing it wrong.

The solution is to rethink and rework data storage technology.

The days of using a single RDBMS to manage all that data are gone. When I really consider it, I'm not sure they ever existed.

Here are my suggestions for managing your "big" data (bullet pointed for your convenience):

  • Archive unused or little used data.
  • Use traditional RDBMSs for transactional data.
  • Use NoSQL for those large, non-shrinkable data volumes.
  • Tier your storage to maximize cost efficiency.
  • Split data sets into manageable chunks based on function or need.
  • Store data in more efficient formats to save space and speed queries.
  • Don't over-normalize data.
  • Use Indexes.
  • Query subsets or representational data sets.
  • Use disk-to-disk backup for speed and reliability.
  • Use the best available technology for storage, retrieval and networking.

The bottom line is that we have to change our technology to accommodate so-called big data. We have to change our storage technology, our database structure, our network architecture and our retrieval methods. No, it isn't easy but what we're doing doesn't work. It's never really worked. Filesystems aren't setup to accommodate huge databases, spanning disk volumes is a dangerous storage method and operating systems aren't up to the task of addressing such huge streams of data on current network infrastructure.

And, you can't keep scaling with our current hardware and software. Theoretical limits say we can but practical limits say that we can't.

We need to explore chip-based storage. We need to develop new data types. We need to research new filesystems and we need new data compression and delivery protocols. We need an RDP or MetaFrame type technology for data delivery across the wire. We have to think differently because data is big.

Data has always been bigger than we can effectively handle. The myth of big data is that it's something new. It isn't. Big data has been with us from the very beginning of digitized information storage and retrieval. We need to face it without buzzwords and marketing hype. We need to face the myth and work on resolving the reality.

What do you think of big data? Do you think there's a solution, and if so, what is it?

*Gas Chromatograph/Mass Spectrometer--I used to be a Chemist and the GC/MS was the best analytic instrument in the lab for organic compounds. Tuning it was a pain but the analysis it provided was a dream come true. I could go on and on.

Topic: Big Data


Kenneth 'Ken' Hess is a full-time Windows and Linux system administrator with 20 years of experience with Mac, Linux, UNIX, and Windows systems in large multi-data center environments.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Next to Useless Article

    I agree with your title. I disagree with many of your recommendations. I think you missed the fundamental importance of data cubes and their supremacy in "Big Data". "Big Data" is more about drawing business intelligence from disparate data sources and less about the volume of data. "Big Data" could actually be very small. Most of your recommendations were based on managing "volume". Some were counter-intuitive to data analysis - and these are the recommendations that do the reader the greatest disservice.

    ???Archive unused or little used data.

    Then how am I going to do a regression analysis on historical data elements? What if I am analyzing census data? "unused" or "little used" data should be properly crafted into a data cube, not written off.

    ???Split data sets into manageable chunks based on function or need.

    Ahh....so now we are nibbling at the edges of data cubes.

    ???Store data in more efficient formats to save space and speed queries.

    Meh again. Although you could say this is a comment about cubes, you could also be referring to storage.

    ???Don???t over-normalize data.

    Welllllll....yes..... Big Data is not about OLTP, it is about analysis. Again: nibbling at the edges of data cubes.

    ???Query subsets or representational data sets.

    This goes back to setting up your data cubes appropriately.
    Your Non Advocate
    • Big data isn't about data volume????

      It's ALL about the volume. That's why they call it "big."
      • It's all about the complexity

        there is a correllation between volume and complexity. But things like regression analysis of sales data in a geographical market may not be "big". "Big Data" is about making sense and analyzing data for business value, not for collecting rows in a database table.
        Your Non Advocate
      • I think you missed something somewhere.

        'Big Data' is akin to 'Big Brother' in the sense that business's are using the data to 'peep' into the habits of their customers. I agree the term is asinine, but your article is pointless in relation to what Big Data truly is referencing.
        As to your database solutions they are incredibly linear, linear solutions will not work, try a different perspective.
  • I Disagree With the Title, In Some Respects

    Just as Newtonian physics become useless at very high speeds, so do techniques for processing data become non-performant with very large data sets.

    Sorting is O(n log 10) meaning sorting 10 million records takes 13 times as long as sorting 1 million records.

    Mapping, that is taking a data object and transforming it to another data object, is O(n), so ten times more records take ten times as much time.

    Folding, or reducing, which takes a list of objects and reduce them to a single value is also O(n).

    Moore's law was sufficient to allow single processor/storage systems to scale with data. I suspect that one sees an inflection point and slope increase with regards to data per user with the internet in the mid-90s.

    So, very large datasets have to be split up, which imposes latency costs via time costs in network control and communication and via the overhead to divide data and recombine the results. With very large distributed processors, the probability of failure during a process increases, and the controller has to have the ability to requery a subset and the retries add time to a job.

    Further speed-ups are possible via caching, but caching also adds overhead in terms of storage to contain the caches, time to check if a value is in the cache, and some form of management in order to be sure that the cache is not stale when it matters if it is.

    All of this added infrastructure is not necessary in the domains for which I wrangle data. Access would be powerful enough, because the data sets are in the hundreds and not the billions. I actually use postgresql which has more power than I need, but provides more scalability than my clients will need and can be deployed on all the server platforms.

    I could transform the dataset management into these large-scale NoSql / map-reducing frameworks - and I have been looking into them - but I'd be spending more time writing for moving targets as SQL and RDBMS's are well-understood, but the new tools are still figuring out where the sweet spots of interfaces, cost/benefit and CAP (http://en.wikipedia.org/wiki/CAP_theorem) are found. But, I see this as adding hours to my development time and the result would be achieving 0.33 seconds in response time instead of 1.2 seconds for internal applications. I work with small businesses, they don't need and won't pay for the jet-powered Pregnant Guppy (http://en.wikipedia.org/wiki/Aero_Spacelines_Pregnant_Guppy) for their data transport.

    I understand where we agree. Back in the day, the slow-and-steady mainframe, which took up half the second floor, did the payroll overnight every other week. People did plan their processes around the limits of the 100% capacity jobs, and as the limits have increased, the size of the routine jobs increased. Even with the new hardware, programmers quickly found themselves at the edge of the flat world, applying all their creativity to keep the ship from going over into the realm of dragons. Think about weather forecasting which gets better with more and more comprehensive station reports. Seven day forecasts? Back when I was a young grown-up, unimaginable. And, I imagine today, meteorologists have a firm grasp on which modeling jobs are out of reach.

    But my point is that something like the CAP theorem wouldn't even be thought of, except that the largest datasets are now magnitudes larger than pre-internet datasets and in constant flux. However the typical (plus/minus one standard deviation) datasets are still manageable without adding the complexities of distributed processing.

    If vendors and customers wish to label the suite of tools for handling tera- and peta-record datasets as BigData, to distinguish from the stuff I need (for which the tools are now given away), then I see why. These are not my grandfather's databases (which would have been filing cabinets.)
  • "traditional" relational database

    A strange term to use for the most up to date management systems we have.

    So called NoSQL systems achieve their performance by going back to pre-relational methods with all their inflexibility, unreliability and lack of intellectual scalability (doing something remotely difficult swiftly mires you in extreme complexity).

    It's NoSQL that's traditional not RDBMSs.
  • You do not get it

    Big Data is not just about the volume. It is more about how to use the data (structured or unstructured) to do business intelligence, segmentation, profiling, risk assessment ... etc. Read the the McKinsey full report about the Big data for more details: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
  • Big Data might not be the most descriptive term, but ...

    The business cases around storage, retention and usage have surely changed.

    - Your examples of past data volume issues are close to irrelevant today, with cheap cloud based storage. Many people can now literally keep close to everything.
    - The ability to define structure as needed is novel and useful. I don't have to know what I want to do with something when I decide to keep it.
    - The ability to operate on such a massive data sink (via MapReduce) turns the DW world on it's head.

    Up until now, the amount of data we keep around (eventually in the warehouse) has shrunk over time due to costs. With insanely cheap storage and processing, that no longer need be true.
  • Big Data is....

    too broad a term to be meaningful in day-to-day conversation without some sort of additional context such as "big data predictive analytics". Alone it encompasses notions of volume, data types, data processing, analytics, etc and pundits often use it as a catch-all term for any type of handling of enormous volumes of data.
  • Data Has Never Been This Big

    With sites like Google and Facebook gathering and processing hundreds of thousands of terabytes of data per day--if not per hour--it is obvious that conventional solutions no longer work. We are not talking about Moore's Law scalability here. This is far, far greater. Not only that, but time is literally money. Meeting SLAs is paramount. Amazon, for example, has to manage literally millions of SKUs. If the customer clicks on a link and has to wait more than a second it would cost Amazon millions of dollars per month.

    You might not like the term, but the problem is very real--and getting worse.
    • The future is relational

      Google and Facebook only have the volumes of data because they choose to gather it. Accuracy and consistency of data is not an issue for Google

      Amazon needs reliable order processing so they use an RDBMS.

      Moore's law applies here just like everywhere else.

      Better implementations of RDBMSs could give far better performance without sacrificing the huge advantages of the relational model.

      The disadvantages to reverting to the old-fashioned methods advocated by the NoSQL school (hierarchical, graph or hash table based) far outweigh the arguable performance advantages.

      In any case NoSQL tools are fast if you happen to want to access the data from one particular direction, but desperately slow and fiendishly complicated should you wish to get at the data in a way the system designer hasn't allowed for.

      Big Data methods are a passing fad, the future is relational!
  • Question

    I may get something thrown at me for suggesting this, but isn't there any place in your list for carefully choosing what data you really need to collect in the first place?
    • A good point.

      I did ask for solutions at the end of the post. That is certainly one to consider. Thanks.
  • Big Data Yup

    I have to agree with some of this. Big data does exist. I think the Googles, Apple's, Microsoft's, Facebook's, and other large companies who collect everything about every person and every device that even looks at them has "Big Data". I am tired of people throwing around the term "Big Data" when they talk to me. We have "Big Data" and need a solution? Really What "Big Data" do you have? ...They can't answer....they just read it some where. Why do you think you have "Big Data"? We have a database that is 5TB in size..that is "Big Data"...No, it is not. If you think that a 5TB database is considered "Big Data" go back and find another occupation! Too many folks throw around buzz words to make themselves feel important and educated when in fact, they look like idiots. Google has "Big Data".... When you get to the point that you are collecting TB upon TB of data, you need to aggregate it or manipulate it, or data mine it, get back to me. If you have a 5 or 10TB database that 90% of the data is just there because no one took the time to archive it, delete it or because you don't know what to do or even why the data exists in the database, then you don't have "Big Data" you have "Stupid Data".
  • Big Data is Bigger Than Big

    Going to have to disagree vehemently with the premise and the analysis. The pull of Big Data is business and the ideas of not-for-profits, not technology: The desire to know more from the pool of data out there has always been there, but the data, technologies to deal with it and the ability to communicate data desires in detail have been missing.

    Fortunately for the first time in IT history, and I wrote my first line of commercial code 39 years ago so have seen a fair amount of that history, we actually have (a) the volume and richness of data, (b) the technology to not just process/manage the data but to map it, integrate it, govern it, and visualize it and learning engines to help with all of this and (c) improving data analyst skills and tools - i.e. the ability to communicate about data in at least a semi-formal language that crosses IT and business lines.

    All of these evolutions have come together in about the same rough timeframe, and this has reignited the "BI/Analytics" space really for the first time in a few decades. Big Data is a perfect storm, and it is transforming IT, business and the way IT and business work together. Some of the Big Data projects to date have saved lives and transformed businesses, and is the next battleground between Web 2.0 businesses and brick-and-mortar style companies.

    Yes, there is hype, there are too many providers, there will be consolidations, and there is an aspect of Big Brother to Big Data. And there is stumbling around, not every project yields hoped for results, it requires new skills and new thinking and a learning curve. But Big Data is not a promise, it is already here, albeit still in the relatively early stages. And all kinds of databases and analytical platforms and clusters and nodes can participate.

    Big Data is absolutely not a myth, it is indeed the next quantum level for BI/analytics, what those disciplines deliver and how they impact life and business.
  • Coudnt Agree more

    A lot of hype around this topic. But I guess its a good thing for the industry. I recently started exploring after ignoring it for sometime. Just dont want to sound outdated. I can imagine a Seinfield episode on this topic ...
    Here are my initial thoughts on Big Data .
  • Interesting read - I agree that Big Data must be viewed as both structured and unstructured data. Our Compuverde Gateway and Object Store software enables robust and redundant storage using clusters of standardized servers to store petabytes of accessible data. When used together, the product suite is applicable and useful for both structured and unstructured file data. Here is a link to our YouTube video explaining our technology: http://www.youtube.com/watch?v=9916BeLq4MM
    Stefan Bernbo