Dr. Donald Trump

This blog post is by Donald Trump, MD, FACP,  CEO and Executive Director of the Inova Dwight and Martha Schar Cancer Institute. Dr. Trump and other members of the Schar Cancer Institute will be blogging on current topics in cancer, research and developments. 

Blog Abstract

  • The use of sophisticated data mining techniques in medical research is growing.
  • Some have even suggested that these emerging data mining techniques can replace clinical trials.
  • However, data mining techniques have inherent research flaws that can lead to misleading insights and results around associations and causation.
  • While large data base examinations can produce compelling hypotheses for medical research, they should not replace time-tested and proven clinical trials.

There has been considerable press coverage and attention recently about the role of big data – particularly the use of sophisticated data mining techniques – in medical research. Some experts have suggested that such techniques can or should replace well-designed clinical trials. Certainly, the potential that data mining has for moving medical research forward is compelling. However, recent coverage from NPR and The Washington Post about the link between the use of antacid drugs and cardiovascular risk go too far.

Both articles cover a recent report in PLoS One by Dr. Nigam Shah and colleagues at Stanford University who used sophisticated “data mining” techniques to search the electronic medical records of over 16 million clinical documents from 2.9 million individuals to examine whether the use of the antacid drugs in a class called proton pump inhibitors (PPIs, think Prilosec, Nexium, Prevacid) was associated with cardiovascular risk (CVR). While there is experimental evidence that these drugs may have unfavorable effects on heart muscle cells, no substantial increased CVR was detected in the large randomized trials done in the development of these drugs. However, based on the data analysis, Dr. Shah does make the connection between these drugs and CVR*. (You can see more detailed information on the study results below.)

Thankfully, the NPR story did include a caution by Dr. David Juurlink, a University of Toronto drug-safety researcher, who noted that such studies can provide misleading results: for example, were factors such as obesity or cigarette smoking – conditions that lead to PPI use – controlled for? People who smoke cigarettes and are overweight are more likely to need this type of medication. Of course, people who are obese and smoke are also more likely to have a heart attack.

Herein lies the problem: an association between PPIs and heart attacks may be found through data mining, but that doesn’t mean that one caused the other. To that end, Juurlink notes that association does not prove causation. However, the associations derived via data mining from such large numbers of observations (2,000,000 patients) tend to be viewed as special and intrinsically valid. Consistent with this perception that the value of a scientific finding is influenced by the size of the population studied, the Washington Post reported this story, and did not present any “counterpoint” argument such as the one Dr. Juurlink made.

In my view, large data base examinations are important, but primarily as hypothesis-generating exercises. I do not believe that big data exercises that determine associations can or should replace well-designed prospective clinical trials. Let’s take a look at how the two approaches compare:

How Do They Compare?

Clinical Trial Research

  • Time tested, proven approach
  • Produces focused and validated research insights
  • Time and resource intensive

Big Data Research

  • Emerging approach
  • Produces broad and general research insights
  • Quick and efficient

While clinical trials certainly have their own set of flaws (being time and resource intensive and oftentimes recruiting patients that are not the same as the “real world” populations in which the treatment will be used), they are the gold standard in medical research and big data analyses are not a replacement for the prospective clinical trial and careful thought by the practicing physician.

What do you think? I’d love to hear your opinion in the comments section..

* In the analysis a “1.16 fold increased association (95% CI 1.09-1.24) with myocardial infarction (MI). Survival analysis in a prospective cohort found a two-fold (HR = 2.00; 95% CI 1.07-3.78; P = 0.031) increase in association with cardiovascular mortality.” The analysis technique provided an accuracy of 89% and has a positive predictive value of 81%. A relatively small effect with a less than perfect PPV.


  1. Stephanie on June 24, 2015 at 12:17 am

    Dr. Trump, I enjoyed reading your insights from the clinical perspective. I personally think the two techniques go well hand-in hand. On one hand, I think that there’s the enormous potential to make new discoveries with the big data approach, but with this emerging technology, there is also the potential to adversely affect patient care through logical fallacies, so I think that it’s vitally important to ensure that results are clinically validated using gold standard techniques. I think that it’s certainly important to note that association does not mean causation, however associations can lead to further hypothesis-driven investigative studies using clinical or laboratory research methods.

    I think there are multiple errors types that can be made with big data methods. One can be a technical error: sloppy data or inaccurate data can essentially “contaminate” the results and draw inaccurate associations, especially when it can be magnified on such a grand scale. For instance: when conducting PCR, a DNA sample contaminated with a small fragment of another DNA strand can significantly alter molecular diagnostic test results. The data has to be accurate and clean through all steps of data entry, generation, processing and analysis. Another error type would be related to poor experimental design, which could result in one encountering logical fallacies not grounded in clinical correlations.

    With that said, I think the use of algorithms, built on data clinically validated data models, could be the standard of care in the future. The parallel I make would be in sequencing technology. Sanger sequencing is the gold standard, however, I believe there will be a time in the near future in which the accuracy of next-generation sequencing will compare with the accuracy of the gold standard, with a fraction of the time and cost.

  2. Donald Trump on June 24, 2015 at 10:01 am

    My thanks to Stephanie Gomez for her comment and insight on this topic. I COMPLETELY agree with the points made and Dr. Gomez has framed the issue extremely well….at least I completely agree which is why I think it is done so well!!
    “Big data” analyses are primarily hypothesis generating and should be validated and her analogies between Sanger and NGS are apt. Whether “big data” analyses are ever precise enough at th eindividual person level to dictate clinical care remains to be determined but is a valid hypothesis. My anxiety is that some believe, I fear, that big data approaches are less expensive, reflect the real world and therefore will soon supplant standard clinical trials. NOT in the near future in my opinion

Leave a Comment