Big Data, Data Mining, and Machine Learning, book review: A sound but traditional approach

Big Data, Data Mining, and Machine Learning, book review: A sound but traditional approach

Summary: This book explains what it covers very well, but in a field that's moving as fast as big data and machine learning, its sound but rather traditional approach may soon look a little dated.

SHARE:
big-data-book-left
Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners • By Jared Dean • John Wiley & Sons • 288 pages • ISBN 978-1-118-61804-2 • £42.50

Big data and machine learning are powering some of today's biggest online businesses and services. If you want to adopt these technologies in your own business, you need to understand not only what they do, but also how they do it. Jared Dean's Big Data, Data Mining, and Machine Learning promises to help business leaders understand the potential and provide guidance for those faced with putting the techniques into practice.

The opening chapter does an excellent job of explaining what big data really is (including the sensible comment that big data goes back to just being data when work with it becomes routine), as well as why working with the full data set can be so much more effective than using a sample. It's a lot of work to make sure your sample is statistically significant, and some behaviour simply won't show up unless you look at all the data.

But right from the beginning, a question mark is raised about the book's approach. A note in the introduction (contributed by an analytics-specialist colleague of Jared Dean's) suggests all the algorithms in the book are at least 15 years old and that "fundamentally new algorithms are not needed" — although the author also points out in the first chapter that algorithms have "matured" recently. Now it's true that older algorithms (all the way back to the various forms of statistical regression) do drive a lot of predictive systems, but that ignores the various advances in machine learning that are behind some of the biggest recent developments.

Jared Dean works at SAS, which has been in the predictive analytics for a long time, so it makes sense that he covers the classics. But it would be a mistake to ignore more recent algorithms, because this is a fast-developing field — even the deep learning that's behind state-of-the-art speech recognition and image matching is going to be within reach for companies wanting to work with big data in the near future.

In fact, Dean spends a lot of time putting information into historical context, from his timeline of big-data-related developments to the potted history of computer components and database technologies. It's definitely worth pointing out that your big data project will fail if you don't have good enough hardware to make it responsive, but the information provided here isn't detailed enough to help you design a system, and as with all hardware advice it will soon be out of date. There's no mention at all of running any of this in the cloud, despite both Google and Microsoft offering cloud-based machine-learning and analytics tools.

The list of relevant software tools also feels somewhat cherry-picked, covering R, Python and a couple of specialist tools, with most of the space devoted to SAS (as you might expect from an author who works for SAS).

Despite its title, this book is much more about big data and data mining for predictive analytics than it is about machine learning.

Much more useful is the section on predictive analytics, which combines common-sense explanations of the principles and fairly high-level details of the statistical techniques involved. You'll also get a lot of history along the way, from commentary on the social background of the two different mathematicians on whose work regression analysis is based, to the history of neural networks. The latter section has plenty of detail on early work like the Perceptron, brief details of the key developments in the 1980s about forward and backward propagation, but then jumps abruptly to the present when neural networks have become widely used — without covering any of the developments that made them popular again.

This is where the concentration on older algorithms can be misleading. Deep learning — the machine learning technique that's currently transforming difficult AI problems like speech recognition and image classification to the point that Google, Microsoft, Facebook and Baidu all employ key researchers in this area — is dismissed in three paragraphs and two out-of-date references. Then we're back to more statistically oriented methods like regression trees and Bayesian network classification and moving on to segmentation, calcification and modelling responses. The section on data mining information that's arranged chronologically is excellent, but the details of recommendation and ranking are brief and very focused on the statistical approach.

Throughout the book, well-chosen real world explanations — from working out what to wear to how much time a short delay leaving a crowded event can add to your journey — help to clarify the complex statistical concepts that make up the majority of the content. This works particularly well in the chapter on text mining, which works through an extended example about Jeopardy! questions to show how powerful this can be. Similarly, the final section of case studies showing how some companies have used big data is very useful, because it goes into the details of how they chose and built their models.

Light on machine learning

Big Data, Data Mining, and Machine Learning ends with a lightning survey of upcoming developments, reiterates Dean's view that 'classic' algorithms in this area are well tested and will serve for a long time, and highlights his scepticism about more recent advances. There is no coverage of entity extraction, feature detection and ranking techniques, which are quickly becoming staples in modern machine learning. The increasingly important principle of combining multiple algorithms to train, evaluate and run your machine learning, comparing the results of different machine learning algorithms to look at the numbers of false positives and negatives you get, is covered in just a couple of pages. And there's nothing about the heuristics you need to apply to machine learning systems to make them fit your problem space correctly.

In fact, despite its title, this book is much more about big data and data mining for predictive analytics than it is about machine learning. It explains what it covers very well, neither skimming at too high a level to be useful nor getting stuck in the weeds of implementations and technical detail. But in a field that's moving as fast as big data and machine learning, this sound but rather traditional approach may soon look a little dated.

Topics: Big Data, Reviews, After Hours

Mary Branscombe

About Mary Branscombe

Mary Branscombe is a freelance tech journalist. Mary has been a technology writer for nearly two decades, covering everything from early versions of Windows and Office to the first smartphones, the arrival of the web and most things inbetween.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

2 comments
Log in or register to join the discussion
  • Predictive Analytics Without Anything Significant on Deep Learning

    What's this another Sear's catalog? Out-of-date before it goes to print...
    Mike@...
  • deep learning is new?

    (I am the writer of the Foreword who mentioned the "old" algorithms).

    Deep learning is a new label, but the core of "deep learning neural networks" is nothing new. I'm not aware of any algorithmic innovation here; it's a re-awakening of the power of neural networks, for which I'm grateful (I'm a fan of neural networks!). If you look at the winners of competitions, they all use model ensembles of different varieties (trees, neural nets, but even probit/logit, and heterogeneous ensembles). Some winners use deep learning neural networks (some, not most). As a final defense of Jared's approach in the book, deep learning algorithms as they stand now have not made their way into the mainstream; they still are in the domain of data scientists and the bleeding edge crowd. I fully admit that I have not deployed a newly-fashioned deep learning neural network, but according the definition (by Le Cun and others), the neural networks I have deployed are indeed "deep learning" (some of those deployments were more than 15 years ago, as it turns out!)

    Terminology is important for communication but fields are difficult to put circles around. Predictive analytics and data mining are essential the same. Machine learning overlaps with data mining considerably; I strongly disagree with the contention in the review and the wikipedia machine learning page that they are such distinctly different disciplines. Most papers in the Machine Learning Journal would easily find their home in a KDD conference. (the wikipedia contention that unsupervised learning is a machine learning function and not a data mining function is absurd, and would be a big surprise to data mining software vendors!).

    Other critiques about software are fair; Jared, as you mention is a SAS employee. But regarding the algorithms, I think you protest too much! But you raise great questions in the review, worthy of consideration and ones that I will blog about for sure.
    deanabb