Statistics can’t save bad science

Big data can’t dig you out of the base rate fallacy.
data science
Published

August 21, 2021

Note

This post expands on an idea I first encountered in a sidebar on page 51 of Richard McElreath’s Statistical Rethinking. If you only read a single stats textbook, that book should be it.

The basis of the scientific method is to begin by constructing a hypothesis, then to test it with an experiment. One rarely discussed component, however, is that in any realistic scenario, there is no experiment we can run which conclusively, exhaustively confirms or denies that hypothesis. This, of course, is where statistics comes into play. A natural step, either following that experiment (or preferably beforehand), would be to quantitatively assess how trustworthy the experiment is. Given that there is a de-facto suite a statistical tests handed to researchers and engineers during school, this is a step often overlooked. Here’s how it would go.

What we want is an experiment that can tell us if our hypothesis is right or wrong. Sadly, this sort of experiment, double-blind or not, does not exist. Instead, we’d settle for the probability that our hypothesis is true, assuming the experiment indicates so.

\[p(\text{hypothesis is true}|\text{experiment says hypothesis is true})\]

In the traditional context, we cannot compute this expression, as it requires 2 prerequisites:

  1. Designing our experiment in such a way that we can leverage the guarantees given by something like the Central Limit Theorem, in order to assign an underlying distribution to the experimental data. (This is common practice)
  2. Making an assumption about the probability that any given hypothesis is true, otherwise known as the base rate. (This is not common practice)

Let’s address each. The CLT grants us access to a couple closely related quantities. The first is the probability that the experiment shows our hypothesis to be true, given that it is indeed true. This is commonly referred to as the test’s statistical power.

\[p(\text{experiment says hypothesis is true} | \text{hypothesis is true})\]

The second is the significance level, the probability that the experiment suggests the hypothesis is true when it is actually false.

\[p(\text{experiment says hypothesis is true} | \text{hypothesis is false})\]

Up to this point, we’ve simply copy and pasted the boilerplate frequentist framing, common to so many in the sciences. We’ve bought ourselves p-values and confidence intervals, but have not gotten any closer to knowing the probability that our hypothesis is true. To do so, we’ll have to conjure up a base rate, a probability that some hypothesis is true, on average and all else held equal.

\[p(\text{hypothesis is true})\]

For the sake of example, let’s say our base rate is 0.1.

Okay, with these pieces, we can finally return back to our original question: “What’s the probability the my original hypothesis is correct if my experiment says it is?”.

\[ \begin{aligned} p(\text{hypothesis is true}|\text{experiment says hypothesis is true}) &= \frac{p(\text{experiment says hypothesis is true}|\text{hypothesis is true}) * p(\text{hypothesis is true})}{p(\text{experiment says hypothesis is true}|\text{hypothesis is true}) * p(\text{hypothesis is true}) + p(\text{experiment says hypothesis is true}|\text{hypothesis is false}) * p(\text{hypothesis is false})} \\ &= \frac{\text{power} * \text{base rate}}{(\text{power} * \text{base rate}) + (\text{significance} * (1-\text{base rate}))} \end{aligned} \]

Now, let’s plug in our base rate and the commonly accepted values for power and significance.

\[ \begin{aligned} p(\text{hypothesis is true}|\text{experiment says hypothesis is true}) &=\frac{0.8 * 0.1}{(0.8 * 0.1) + (0.05 * 0.9)}\\ &= 0.64 \end{aligned} \]

That’s right, dot your i’s and cross your t’s and still your certainty in the results of the experiment should only be slightly better than a coin flip. Disappointingly, carrying out longer-running experiments, thereby raising the the power and driving down the significance level don’t impact things as much as we might hope. The most potent lever to tweak is the base rate. And improving that requires thinking, not testing. 🤯