Three years ago, artificial intelligence pioneer Geoffrey Hinton said, “We should stop training radiologists now. It’s just completely obvious that within five years, deep learning is going to do better than radiologists.”
Today, hundreds of startup companies around the world are trying to apply deep learning to radiology. Yet the number of radiologists who have been replaced by AI is approximately zero. (In fact, there is a worldwide shortage of them.)
At least for the short term, that number is likely to remain unchanged. Radiology has proven harder to automate than Hinton — and many others — imagined. For medicine in general, this is no less true. There are many proofs of concept, such as automated diagnosis of pneumonia from chest X-rays, but surprisingly few cases in which deep learning (a machine learning technique that is currently the most dominant approach to AI) has achieved the transformations and improvements so often promised.
To begin with, the laboratory evidence for the effectiveness of deep learning is not as sound as it might seem. Positive results, when machines using AI outdo their human counterparts, tend to get considerable media attention while negative results, when machines don’t do as well as humans, are rarely reported in academic journals and get even less media coverage.
Meanwhile, a growing body of literature shows that deep learning is fundamentally vulnerable to “adversarial attacks,” and is often easily fooled by spurious associations. An overturned school bus, for example, might be mistaken for a snowplow if it happens to be surrounded by snow. With a few pieces of tape, a stop sign was altered so a deep learning system mistook it for a speed limit. If these sorts of problems have become well-known in the machine learning community, their implications are less well-understood within medicine.
For example, deep-learning algorithms trained on X-ray images to make diagnostic decisions can easily detect the imaging machine used to make the images. Consider this hypothetical situation: two different models of x-ray machine are used in a hospital — one portable, one installed in a fixed location. Patients who are bedridden due to their conditions must be imaged at the bedside using the portable machine. That means the choice of machine becomes correlated with the presence of the condition. And since the AI algorithm is highly sensitive to which machine was used, it may inadvertently mistake information that is machine-specific about the underlying condition. The same algorithm applied in a hospital that always uses the portable machine may produce confounded decisions.
In truth, deep learning is deep only in a narrow, technical sense — how many “layers” of quasi-neurons are used in a neural network — not in a conceptual sense. Deep-learning systems excel at finding associations within the training data, but have no ability to differentiate what is causally relevant from what is accidentally correlated, like fuzz on an imaging device. Spurious associations can wind up being heavily over-weighted.
In diagnosing skin cancer from images, for example, a dermatologist might use a ruler to size a lesion only if he or she suspects it is cancerous. In this way, the presence of a ruler becomes associated with a cancer diagnosis in the image data. An AI algorithm may well leverage this association, instead of the visual appearance of the lesion, to make cancer decisions. But rulers aren’t actually causing cancer, meaning the system can easily be misled.
Radiology is not just about images. Deep-learning systems excel at classifying images but radiologists (and other doctors, such as pathologists) must integrate what they see in an image with other facts about patient history, currently prevalent illnesses, and the like. As Dr. Anand Prabhakar, an emergency radiologist at Massachusetts General Hospital told us, “Although radiologists are experts at imaging pattern recognition, a large fraction of the work involves understanding the pathophysiology of disease so images can be interpreted correctly. For example, pneumonia on a chest X-ray could also have the same appearance as a variety of conditions, including cancer. A typical radiologist will suggest a diagnosis based on the patient’s clinical presentation obtained from the electronic medical record — such as fever, age, gender, smoking history, or bloodwork.”
Much of the AI research done so far has focused primarily on statistical measures that are computed in isolation on laboratory benchmarks, and lack enough validation against external data collected under different conditions. A race to proof of concept has not been matched by a methodology sufficient to show that laboratory efficacy translates to real world efficiency.
In drug trials, it is a given that success in Phase 1 is no guarantee of success in Phase 3, whereas in the current mania for deep learning and AI, a preliminary proof of concept is taken seriously — but prematurely — as cause to revamp medical school curricula.
The core issue is that what happens when an algorithm is applied to a narrow training set does not necessarily extend to physical, real-world situations that are outside the scope of the training data.
To be sure, many researchers are working to solve this problem. But more data alone is unlikely to do the trick. What we are really missing are systems that have deeper understanding.
Without deeper understanding, a more realistic use case for existing techniques is one in which deep learning becomes a powerful but closely monitored tool for radiologists in so-called human-in-the-loop fashion, subject to frequent human-performed sanity checks rather than a complete system entrusted with full responsibility acting autonomously on its own. As one new study showed, “our models can be used to not substitute but assist radiologists with their work, leading to better outcomes for patients.”
To build algorithms that are trustworthy enough to delegate critical health care decisions, AI systems must be able to leverage knowledge of biology and the real world. In order not to be distracted by surgical pen marks on digital images of nevi when diagnosing skin cancer, for example, a system must at least need to understand what a pen mark looks like, and it would help that it understands that human skin does not spontaneously develop bright violet pen stripes or dots.
Just as gathering more data alone has not yet magically led to trustworthy autonomous vehicles with “common sense,” more data alone is not likely to lead to medical systems with a grasp of anatomy, physiology, and molecular biology.
Genuine advances will require a sustained research effort into reworking the fundamentals of AI as we know it. The long-promised AI revolution may someday arrive, but for patient safety, it is important we be leery of premature declarations of victory.
Gary Marcus is CEO and founder of Robust.AI, professor emeritus at New York University, and co-author of “Rebooting AI: Building Artificial Intelligence We Can Trust” (Pantheon, September 2019). Max A. Little is associate professor of machine learning at the University of Birmingham, U.K., and author of “Machine Learning for Signal Processing” (Oxford University Press, October 2019).