Evolution of Hypothesis Testing


This is the second post in the series of “Let’s not hypothesize.” The first post is available here.

This post is written to have a brief look at how the Hypothesis testing seen in most Statistics texts came into being.

My main sources of information are;

1) The empire of Chance

2) The lady tasting tea, and

3) Explorations in statistics: hypothesis tests and P values

I have the evolution separated into three phases.

1) Pre-Fisher:

The Explorations in statistics: hypothesis tests and P values provides a date of 1279 as the origin of Hypothesis testing. The Royal Mint from London used a sample of coins made from each run of the mint which were compared against a known set of standards. I welcome the reader to click on the third reference given above to read this in more detail.


The article also speaks about William Sealy Gosset (Student) and his t-test method. What struck me most was the description of Gosset explaining the significance of a drug in terms of an odds ratio. This was well before the advent of p-values to determine significance of the data.

First let us see what is the probability that [drug A] will on the average give increase of sleep. [Looking up the ratio of the sample mean to the sample standard deviation] in the table for ten experiments we find by interpolating. . .the odds are .887 to .113 that the mean is positive. That is about 8 to 1 and would correspond to the normal curve to about 1.8 times the probable error. It is then very likely that [drug A] gives an increase of sleep, but would occasion no surprise if the results were reversed by further experiments.

2) Sir Ronald Fisher:


It was Sir Ronald Fisher who clearly came up with the idea of a null hypothesis (H0) and the use of a conditional probability p-value to make a decision based on the data found. He termed this as “Significance Testing”. The main distinction here from the texts today, is that Fisher only used Null or Nil Hypothesis. He did not find value in the alternate hypothesis. His thought process was that if the p-value was less than a cut-off point (let’s say .05), this would indicate that either this was due to a very rare event or that the null hypothesis model was wrong. More than likely, it is highly probable that the null hypothesis model was wrong. Fisher did not see a need for an alternate hypothesis nor the need for repeating tests to see how powerful the test was.His method is based on Inductive Inference.

Fisher never also meant to use only .05 as the cut-off value. He viewed p-values as inductive evidence against the null hypothesis.

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

3) Neyman-Pearson Hypothesis Testing:


The books “Lady tasting tea” and “The empire of chance” go into detail about the “feud” between the great minds Fisher, and Neyman/Pearson.

It was Neyman and Pearson who came up with idea of using an alternate hypothesis (H1) and testing it against the null hypothesis. Additionally, they also created the idea of the power of a test, and introduced the ideas of type I and type II errors. They termed their version as Hypothesis testing.Their version is based on inductive behavior.

They defined alpha, beta and power as follows.

alpha = P(reject H0|H0 is true)

beta = P(fail to reject H0|H0 is false)

power = 1 – beta

Where we are now:

What we use and learn these days is a combined method of Fisher and Neyman/Pearson. The textbook method is generally as follows;

1) define null and alternate hypotheses.

2) set an alpha value of .05, and power value of .80 before the experiment.

3) calculate test statistic and p-value based on the data collected.

4) Reject or retain (fail to reject) null hypothesis based on the p-value.

Critiques of this combined method claim that the combined method utilizes the worst of the two methods. They emphasize the focus on effect size, and the use of confidence intervals to provide better view of the problem at hand, rather than blindly relying on the p-value alone.

Keep on learning…


Let’s not hypothesize – Part 1


Over the last few months, there has been a lot of frenzy in a small portion of the blogosphere over the “ban of P-values” by the Psych magazine¬† Basic and Applied Social Psychology (BASP). You can read the full editorial here.


My goal is to create a series of posts covering the items discussed in the editorial. This will include talking about the evolution of hypothesis testing, p-values and confidence intervals.

Some of the highlights from the editorial are below.

1) p < .05 is too easy and leads to low quality papers:

we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research.

There has been a lot of papers about the traditional approach of using p < .05 or even <.01 as being arbitrary values. I welcome the reader to check out this webpage by Chris Fraley, which has a collection of articles and papers about Null Hypothesis Significance Testing (NHST) and p-values.


2) Confidence Intervals are no better either:

Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval.

To me, this is very interesting. I have always relied on confidence intervals to get a bound on the uncertainty around my statistic. The magazine banned the use of Confidence Intervals as well.

Interestingly enough, Bayesian procedures are not “banned”.

Bayesian procedures are neither required nor banned from BASP.
I will be very interested in seeing how this impacts the other fields outside Social Psychology. It is true that many scholars have challenged the idea of using p-values, and offered suggestions to include power, confidence intervals, etc. But this editorial challenges all of that.
Keep on learning…