Big data and machine learning are powering some of today's biggest online businesses and services. If you want to adopt these technologies in your own business, you need to understand not only what they do, but also how they do it. Jared Dean's Big Data, Data Mining, and Machine Learning promises to help business leaders understand the potential and provide guidance for those faced with putting the techniques into practice.
The opening chapter does an excellent job of explaining what big data really is (including the sensible comment that big data goes back to just being data when work with it becomes routine), as well as why working with the full data set can be so much more effective than using a sample. It's a lot of work to make sure your sample is statistically significant, and some behaviour simply won't show up unless you look at all the data.
But right from the beginning, a question mark is raised about the book's approach. A note in the introduction (contributed by an analytics-specialist colleague of Jared Dean's) suggests all the algorithms in the book are at least 15 years old and that "fundamentally new algorithms are not needed" — although the author also points out in the first chapter that algorithms have "matured" recently. Now it's true that older algorithms (all the way back to the various forms of statistical regression) do drive a lot of predictive systems, but that ignores the various advances in machine learning that are behind some of the biggest recent developments.
Jared Dean works at SAS, which has been in the predictive analytics for a long time, so it makes sense that he covers the classics. But it would be a mistake to ignore more recent algorithms, because this is a fast-developing field — even the deep learning that's behind state-of-the-art speech recognition and image matching is going to be within reach for companies wanting to work with big data in the near future.
In fact, Dean spends a lot of time putting information into historical context, from his timeline of big-data-related developments to the potted history of computer components and database technologies. It's definitely worth pointing out that your big data project will fail if you don't have good enough hardware to make it responsive, but the information provided here isn't detailed enough to help you design a system, and as with all hardware advice it will soon be out of date. There's no mention at all of running any of this in the cloud, despite both Google and Microsoft offering cloud-based machine-learning and analytics tools.
The list of relevant software tools also feels somewhat cherry-picked, covering R, Python and a couple of specialist tools, with most of the space devoted to SAS (as you might expect from an author who works for SAS).
Despite its title, this book is much more about big data and data mining for predictive analytics than it is about machine learning.
Much more useful is the section on predictive analytics, which combines common-sense explanations of the principles and fairly high-level details of the statistical techniques involved. You'll also get a lot of history along the way, from commentary on the social background of the two different mathematicians on whose work regression analysis is based, to the history of neural networks. The latter section has plenty of detail on early work like the Perceptron, brief details of the key developments in the 1980s about forward and backward propagation, but then jumps abruptly to the present when neural networks have become widely used — without covering any of the developments that made them popular again.
This is where the concentration on older algorithms can be misleading. Deep learning — the machine learning technique that's currently transforming difficult AI problems like speech recognition and image classification to the point that Google, Microsoft, Facebook and Baidu all employ key researchers in this area — is dismissed in three paragraphs and two out-of-date references. Then we're back to more statistically oriented methods like regression trees and Bayesian network classification and moving on to segmentation, calcification and modelling responses. The section on data mining information that's arranged chronologically is excellent, but the details of recommendation and ranking are brief and very focused on the statistical approach.
Throughout the book, well-chosen real world explanations — from working out what to wear to how much time a short delay leaving a crowded event can add to your journey — help to clarify the complex statistical concepts that make up the majority of the content. This works particularly well in the chapter on text mining, which works through an extended example about Jeopardy! questions to show how powerful this can be. Similarly, the final section of case studies showing how some companies have used big data is very useful, because it goes into the details of how they chose and built their models.
Light on machine learning
Big Data, Data Mining, and Machine Learning ends with a lightning survey of upcoming developments, reiterates Dean's view that 'classic' algorithms in this area are well tested and will serve for a long time, and highlights his scepticism about more recent advances. There is no coverage of entity extraction, feature detection and ranking techniques, which are quickly becoming staples in modern machine learning. The increasingly important principle of combining multiple algorithms to train, evaluate and run your machine learning, comparing the results of different machine learning algorithms to look at the numbers of false positives and negatives you get, is covered in just a couple of pages. And there's nothing about the heuristics you need to apply to machine learning systems to make them fit your problem space correctly.
In fact, despite its title, this book is much more about big data and data mining for predictive analytics than it is about machine learning. It explains what it covers very well, neither skimming at too high a level to be useful nor getting stuck in the weeds of implementations and technical detail. But in a field that's moving as fast as big data and machine learning, this sound but rather traditional approach may soon look a little dated.