Does anyone really understand big data?

Does anyone really understand big data?

Summary: We all hear this and that about big data but does anyone really understand it? The short answer is 'yes', but the list is of those who do is very short. What is it about big data that makes it so hard to understand?

TOPICS: Big Data

The single point that makes understanding big data so elusive is that we, as a technology community, weren't really prepared for big data or its management. You might say, "Yes, we were." But the truth is that we weren't and here's why. 90 percent of the world's data was generated in the last two years*. That's the reason why we were caught off our guard. There's so much new data produced every day from social media sites, industrial sensors, satellites, cell phones, photographs, documents, and much more. Every day our data grows by more than 2.5 quintillion bytes (2,500,000,000,000,000,000) or just over two billion (2,328,306,436.5) gigabytes. That data has to be stored somewhere—even temporarily—and sent through databases and applications for analysis. There's so much new data piling up that its storage, management, and analysis are overwhelming. This is why very few really understand big data.

This huge amount of data is why you're hearing so much about big data and why its understanding is difficult. As I've said in the past, data has always been big relative to our capacity to store, retrieve, analyze, organize, archive, and purge but now the situation is almost out of our collective control.

We know how the data is generated. We know generally why we're generating that data. We know what we're supposed to do with that data but what we don't know is how to handle that much data.

"...we, as a technology community, weren't really prepared for big data or its management."

In fact, we're not even sure how to handle the metadata generated by big data. 

As a side note, you might have heard a lot about metadata lately concerning the private information that the NSA has captured and analyzed. Metadata is data about data. It's a strange concept but, simply stated, metadata is a description of your data and you use metadata all the time but might not realize it. For example, when you snap a digital picture, the metadata for that picture is the size, date, location, dimensions, pixels, and so on.

Other types of metadata:

  • Means of creation of the data
  • Purpose of the data
  • Time and date of creation
  • Creator or author of the data
  • Location on a computer network where the data were created
  • Standards used

All you have to do to check out metadata for a photo is to right click the photo file, select Properties, and then select the Details tab.

You can see that metadata also takes up space but is not the data itself. It is data about data. So we could discuss big metadata as well as big data. Now you probably have a better idea of why our data grows at such a high rate, when you understand that there's more to data than just the data itself.

To clarify, metadata doesn't make big data big, it makes big data bigger.

Now that you have an understanding of data and metadata, you can now explore what big data is.

Big data is a lot of data. It's more data than we've ever dealt with before and from more disparate sources. Plus the metadata. It's a lot to think about. It's a lot to store. It's a lot to analyze. And those are the major issues of big data.

When data becomes so big that its sheer size is the problem, it is big data.

Still, you might wonder, what makes big data so difficult to understand?

As I stated above, we have data generated from disparate data sources: cell phones, satellites, electronic sensors, text messages, logfiles, etc. Data from so many sources is very complex. 

To explain further, if all of your data is photographs, then your data is simple. You add complexity when you have multiple data types and multiple data sources. If you run a logistics company, such as UPS**, then you have data coming in from many sources. Let's just look at three of those to consider the complexity: employees, trucks, and packages. Of course, their actual data is far more complex but I will take those three as a good example.

Data from trucks could include truck location (GPS tracking), fuel consumption, maintenance records, purchase price, insurance records, number of loads delivered, driver name, and so on. Now think about all of the different data points within each of those general areas. Maintenance records could include oil changes, tires, battery, every single replaceable part, damage, mileage, and more. Multiply all of those data points by the thousands of trucks that UPS currently has in operation. 96,394 total vehicles.

To the truck data, add all of the employee information that you can think of. Add in the data for the truck drivers, the truck packers, the truck unpackers, maintenance personnel, medical records for employees, vacation tracking, device tracking, uniform tracking, and any other employee related data points. 397,100 employees. 

Add in the third data source, packages, to the mix. Package weight, origin, insurance, destination, shipping method, dimensions, pickup information, connecting points between origin and destination. 16.3 million packages per day.

You can see how quickly the data points grow along with the volume of data that UPS deals with. UPS collects a lot of interesting and different data points. Those statistics in that list are not raw data. Statistics are the result of analyses. Consider the number of database servers, the amount of storage, and the energy cost to generate the data on that single page.

This is big data. You have to collect, store, analyze, organize, purge, and use the data. It's that process from collection to use to purge that is the great unknown of big data. Big data is complex and difficult to manage.

The management part of big data is where the lack of understanding comes from. There are very few people who know how to manage that volume and complexity of data. Most companies have grown their own pieced together solutions. Each department usually tries to manage its own data in various forms. What happens is not only do these companies have huge amounts of disparate data, the data is stored in disparate locations, and in disparate data technologies. Big data. Big mess.

Now you should have a better understanding of what big data is, where it comes from, why it's big, and what the problem is with big data. If you still don't have a clue as to how to manage big data or what you'd do with it, join the club, you have a lot of company.

Why do you think big data is so difficult to understand? Or is it? Talk back and let me know.

Note: I used UPS as an example since I know that it generates a lot of data. Currently it manages over 16 petabytes of ones and zeros.

*According to IBM's analyses.

**UPS (United Parcel Service) - I'm using it as an example for illustrative purposes only. I have no affiliation with UPS and I happily use the service.

Topic: Big Data


Kenneth 'Ken' Hess is a full-time Windows and Linux system administrator with 20 years of experience with Mac, Linux, UNIX, and Windows systems in large multi-data center environments.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • IT shops understand big data well

    The fact is only a tiny slice of companies can benefit from a big data infrastructure spend. Most IT shops can safely skip big data and cloud entirely and be fine.
    • @ammohunt

      Well, you can't skip it if you have it, so I'm not sure what you mean.
  • Brilliant wand educational article...

    ...shaping zeros and ones into proper data that can then be transformed into human readable information is an interesting subject.

    Also, the need to ensure big data is not mainly due to a duplicate data is another challenge.
  • Big Data

    Ken, great insight! It is worth mentioning the HPCC Systems open source offering which provides a single platform that is easy to install, manage and code. Their built-in analytics libraries for Machine Learning and integration tools with Pentaho for great BI capabilities make it easy for users to analyze Big Data. Their online introductory courses allow for students, academia and other developers to quickly get started. For more info visit:
  • Does big data really matter?

    Hess makes a valid point when he states that generally, we know why we're generating data, we know what we're supposed to do with that data, but what we don't know is how to handle that much data. But do we really need that much data to begin with? The short answer is no.

    So how can businesses get more from their existing data through effective and efficient use? The key is to focus on driving rich, broad understanding from the information that they hold, rather than competing with other businesses to collect all possible data. Medium-sized businesses should not be seduced by the technology industry’s hyperbole for big data; instead, they should focus on exploring their existing data more thoroughly, imaginatively and effectively before worrying about the technologically and ethically complex domain of big data. Most companies would be better off spending more time examining outcomes from the existing data they possess; there is little point collecting vast amounts of data until the organisation is able to apply consistent and informative reporting and analytics to it.

    Visual analytics is the swiftest and most accessible route into data analysis, utilising our visual perception and its innate pattern discovery; no-one needs a degree in statistics or mathematics to be an effective visual analyst. The best of the current breed of visual analytics software encourages exploration of data, of all forms and sizes, and enables simple, effective communication of insights discovered to non-technical audiences.

    To make commercial sense of data, businesses must embrace the belief that rich, well organised data enables strategic vision into business operations. Through visual analytics, insight can emerge incredibly quickly, easily and at a relatively low cost, converting existing data into meaningful business intelligence which can drive business decision-making.

    Irrespective of size, data is data and companies should strive to tap into it to draw out every relevant insight.

    Guy Cuthbert,
    Managing Director,
    Atheon Analytics
    Mark Kedgley
  • Raiders of the Lost Ark

    So I am reminded of the last scene on the Raiders of the Lost Ark, "Top people" are looking at it and then the pan back shot of the large Government warehouse! Big data is only useful for collecting if you going eventually do something with it.