The single point that makes understanding big data so elusive is that we, as a technology community, weren't really prepared for big data or its management. You might say, "Yes, we were." But the truth is that we weren't and here's why. 90 percent of the world's data was generated in the last two years*. That's the reason why we were caught off our guard. There's so much new data produced every day from social media sites, industrial sensors, satellites, cell phones, photographs, documents, and much more. Every day our data grows by more than 2.5 quintillion bytes (2,500,000,000,000,000,000) or just over two billion (2,328,306,436.5) gigabytes. That data has to be stored somewhere—even temporarily—and sent through databases and applications for analysis. There's so much new data piling up that its storage, management, and analysis are overwhelming. This is why very few really understand big data.
This huge amount of data is why you're hearing so much about big data and why its understanding is difficult. As I've said in the past, data has always been big relative to our capacity to store, retrieve, analyze, organize, archive, and purge but now the situation is almost out of our collective control.
We know how the data is generated. We know generally why we're generating that data. We know what we're supposed to do with that data but what we don't know is how to handle that much data.
"...we, as a technology community, weren't really prepared for big data or its management."
In fact, we're not even sure how to handle the metadata generated by big data.
As a side note, you might have heard a lot about metadata lately concerning the private information that the NSA has captured and analyzed. Metadata is data about data. It's a strange concept but, simply stated, metadata is a description of your data and you use metadata all the time but might not realize it. For example, when you snap a digital picture, the metadata for that picture is the size, date, location, dimensions, pixels, and so on.
Other types of metadata:
- Means of creation of the data
- Purpose of the data
- Time and date of creation
- Creator or author of the data
- Location on a computer network where the data were created
- Standards used
All you have to do to check out metadata for a photo is to right click the photo file, select Properties, and then select the Details tab.
You can see that metadata also takes up space but is not the data itself. It is data about data. So we could discuss big metadata as well as big data. Now you probably have a better idea of why our data grows at such a high rate, when you understand that there's more to data than just the data itself.
To clarify, metadata doesn't make big data big, it makes big data bigger.
Now that you have an understanding of data and metadata, you can now explore what big data is.
Big data is a lot of data. It's more data than we've ever dealt with before and from more disparate sources. Plus the metadata. It's a lot to think about. It's a lot to store. It's a lot to analyze. And those are the major issues of big data.
When data becomes so big that its sheer size is the problem, it is big data.
Still, you might wonder, what makes big data so difficult to understand?
As I stated above, we have data generated from disparate data sources: cell phones, satellites, electronic sensors, text messages, logfiles, etc. Data from so many sources is very complex.
To explain further, if all of your data is photographs, then your data is simple. You add complexity when you have multiple data types and multiple data sources. If you run a logistics company, such as UPS**, then you have data coming in from many sources. Let's just look at three of those to consider the complexity: employees, trucks, and packages. Of course, their actual data is far more complex but I will take those three as a good example.
Data from trucks could include truck location (GPS tracking), fuel consumption, maintenance records, purchase price, insurance records, number of loads delivered, driver name, and so on. Now think about all of the different data points within each of those general areas. Maintenance records could include oil changes, tires, battery, every single replaceable part, damage, mileage, and more. Multiply all of those data points by the thousands of trucks that UPS currently has in operation. 96,394 total vehicles.
To the truck data, add all of the employee information that you can think of. Add in the data for the truck drivers, the truck packers, the truck unpackers, maintenance personnel, medical records for employees, vacation tracking, device tracking, uniform tracking, and any other employee related data points. 397,100 employees.
Add in the third data source, packages, to the mix. Package weight, origin, insurance, destination, shipping method, dimensions, pickup information, connecting points between origin and destination. 16.3 million packages per day.
You can see how quickly the data points grow along with the volume of data that UPS deals with. UPS collects a lot of interesting and different data points. Those statistics in that list are not raw data. Statistics are the result of analyses. Consider the number of database servers, the amount of storage, and the energy cost to generate the data on that single page.
This is big data. You have to collect, store, analyze, organize, purge, and use the data. It's that process from collection to use to purge that is the great unknown of big data. Big data is complex and difficult to manage.
The management part of big data is where the lack of understanding comes from. There are very few people who know how to manage that volume and complexity of data. Most companies have grown their own pieced together solutions. Each department usually tries to manage its own data in various forms. What happens is not only do these companies have huge amounts of disparate data, the data is stored in disparate locations, and in disparate data technologies. Big data. Big mess.
Now you should have a better understanding of what big data is, where it comes from, why it's big, and what the problem is with big data. If you still don't have a clue as to how to manage big data or what you'd do with it, join the club, you have a lot of company.
Why do you think big data is so difficult to understand? Or is it? Talk back and let me know.
*According to IBM's analyses.
**UPS (United Parcel Service) - I'm using it as an example for illustrative purposes only. I have no affiliation with UPS and I happily use the service.