Is Google breast cancer detection AI better than doctors? Not so fast

Google's Google Health and its DeepMind unit, along with London's Imperial College, report in this week's Nature magazine about how a trio of deep-learning networks can in some cases best human radiologists in reading a mammogram. But the fine print shows we're not yet at the point of replacing radiologists.

Google’s AI makes memory a game Tiernan Ray explains how DeepMind, a unit of Google that develops ambitious AI projects, found a way to stimulate the kind of long-term planning of risk and reward that humans do by turning memory into a game of actions and future payoffs. Read more:

How much credit do you get if you're "pretty right" -- meaning, more right than wrong?

If you're an artificial intelligence algorithm, you're given a lot of credit. AI programs don't have to have a definitive answer, just a probabilistic one, a percentage likelihood of the right answer, whether the task is performing natural-language translation or diagnosing cancer. 

The latest example of AI's probabilistic achievements is in this week's issue of Nature magazine, titled "International evaluation of an AI system for breast cancer screening," and is authored by an army of 31 scholars from Google's Google Health unit, its DeepMind unit, and the Imperial College of London, led by authors Scott Mayer McKinney, Marcin T. Sieniek, Varun Godbole, and Jonathan Godwin (DeepMind CEO Demis Hassabis is among the authors).

In addition, a blog post gives commentary by Google's Google Health scholars Shravya Shetty, M.S., and Daniel Tse, M.D.


Google's Google Health team, its DeepMind unit, and London's Imperial College used a trio of three different deep learning neural networks, consisting of, from the top, Facebook AI's "RetinaNet," combined with Google's "MobileNetV2," followed by the now standard ResNet-v2-50 in the middle section, and lastly a ResNet-v1-50 on the bottom layer. Each one picks out suspicious-looking areas of a mammogram in different ways, and the findings are summed to reach a probability decision about cancer or no cancer. 

Google Health, DeepMind, Imperial College London

The headline news is that Google's science bested both UK and US radiologists in looking at mammograms years after the fact and declaring whether or not there was cancer, demonstrating "an absolute reduction [...] in false positives and [...] in false negatives." The AI tech even beat out a panel of six human radiologists commissioned for the task who looked at five hundred mammograms and gave their diagnosis. 

The upshot is an important contribution in terms of AI tools that could be very useful to doctors. But that doesn't mean it can replace human assessment. It's important to look closer at the numbers, where there are lots of puts and takes. 

Consider the setting. The scientists gathered data in the UK from three different hospitals, on women who had been screened for breast cancer between 2012 and 2015 who met certain criteria of such as age and examination, a total of 13,918 women. That was what they used to train the system. Another 26,000 cases were used to test the system once it was trained. They also did the same process with one US hospital's data, Northwestern Memorial Hospital, gathered from 2001 to 2018, a much smaller sample. (In case you were wondering, the authors acknowledge a similar study done by New York University, the results of which were published earlier this year. According to the Google authors, one of the most important differences is that they included data from as much as three years' worth of follow-up consultations, whereas the NYU study, like prior studies, was limited to case histories of a year or less.)

Also: AI makes inroads in life sciences in small but significant ways: Lantern Pharma's quest

The scientists trained an ingenious set of three different neural networks that each looked at mammograms at different levels of detail. The particulars of this set-up of deep learning is fascinating and perhaps represents the state of the art in combining machine learning networks. One is ResNet V-1 50, by now a classic image recognition approach, developing by Kaiming He and colleagues at Microsoft in 2015. A second net was RetinaNet, developed by Facebook AI Research scholars in 2017. And a third is the MobileNet V2 neural network unveiled by Google scientists last year. It's really a wonderful mash-up of approaches that shows how code-sharing and open scientific publication can enrich everyone's work. The details are contained in the supplementary materials paper that is linked to at the bottom of the main Nature paper. 

Now here comes the tricky part: The "ground truth" of whether any of the cases on which the trained network is judged is a breast cancer case confirmed by subsequent biopsies. The diagnosis, in other words, was not just what things looked like in an image but what subsequent medical testing found by definitively extracting a piece of cancerous tissue. The answer, in that case, was an unequivocal yes or no as to the presence of cancer. 

But the exquisite collection of three deep learning neural networks described above doesn't produce a yes or a no, not really. It produces a score from zero to one, as a "continuous value," rather than a binary judgment. In other words, AI can be pretty right or pretty wrong, depending on how close or far from the right value, zero or one, it comes in any given case.   

To match that probability score to what humans do when they render a judgment, McKinney, and colleagues had to convert the probability score of the AI into binary values. They did this via a separate set of validation tests that picked out individual answers. The comparisons for "superiority" to human judgment are a selection of answers the AI gave within the broader set of total answers it produced. 

As the authors explain, "The AI system natively produces a continuous score that represents the likelihood of cancer being present," and so, "to support comparisons with the predictions of human readers, we thresholded this score to produce analogous binary screening decisions," where "threshold" in this case means picking out a single point to compare: "For each clinical benchmark, we used the validation set to choose a distinct operating point; this amounts to a score threshold that separates positive and negative decisions."

Also: Google's DeepMind follows a mixed path to AI in medicine

When compared to the UK data, the AI did about as good as people in terms of predicting whether something is cancer. The term is "non-inferior," as the report puts it, meaning, it's no worse than human judgment. The area where the AI nets did measurably better was in what's called "specificity," a statistical term that means that the neural nets were a bit better at avoiding false positives, meaning, predicting disease when it isn't there. That's certainly important because getting a false diagnosis of cancer means much undue stress and anxiety for women. 

But again, pay attention to the fine print. The human score, in this case, was from doctors who had to render a judgment about whether to have further tests performed based on the mammogram, such as the biopsy. It's conceivable that doctors in the early stages of diagnosis could render an overly-broad assessment in order to move a patient along to further testing so as not to risk incidences of undetected cancer. That's a fundamental difference between a doctor deciding where to go next with a patient and a machine guessing a probability of an outcome years down the road. 

Put another way, a doctor sitting in front of a patient is not usually trying to guess probabilities for outcomes years down the road so much as trying to determine what's the next critical step for this patient to take. For example, even if AI determines in a particular case that the probability of cancer is low based on the mammogram, would a patient want their doctor to err on the side of caution, and prescribe a biopsy, to be safe rather than sorry? They might very well appreciate such caution.

In fact, the scientists write in the summary section that the AI also missed several cases that doctors caught that turned out to be cancerous, even as the AI found cases the doctors missed. This was especially true for the additional "reader study" where six human radiologists looked at five hundred cases of cancer screens. The researchers found "a sample cancer case that was missed by all six radiologists, but correctly identified by the AI system," but also "a sample cancer case that was caught by all six radiologists, but missed by the AI system."

Also: The subtle art of really big data: Recursion Pharma maps the body

Somewhat troubling, the authors write that it's not entirely clear why AI succeeds or fails in each case: "Although we were unable to determine clear patterns among these instances, the presence of such edge cases suggests potentially complementary roles for the AI system and human readers in reaching accurate conclusions."

Maybe, but certainly, one wants to know more about how the three deep learning neural nets are making their probability guesses. What are they seeing, so to speak? That question, a question of what the networks are representing, is not addressed in the study, but it's a crucial question for AI in such a sensitive application. 

One big question as a result of all of the above is: How much effort should be given to a system that can predict probabilities of a future development of cancer more accurately than some doctors who have to make an initial assessment? If those probability scores can help doctors decide in some "edge cases," as it were, the value could be very high for assisting doctors with AI, even if it can't really replace them at this point.

On a tangential note, the study, which looked at both UK and US data, offers some puzzling findings of comparative health system quality. In general, the level of accuracy in the UK doctors seems measurably higher than that of the US in terms of concluding correctly, from an initial scrutiny of the tests, that something is going to turn out to be cancer. 

Given the disparity of the datasets used -- 13,981 in the UK, from three hospitals, versus 3,097 in the US from just one hospital — it's really hard to know how to take those disparate results. Apparently, just as intriguing as AI is the relative ability of human doctors in two different medical systems.