This is the second post in the series of “Let’s not hypothesize.” The first post is available here.
This post is written to have a brief look at how the Hypothesis testing seen in most Statistics texts came into being.
My main sources of information are;
2) The lady tasting tea, and
I have the evolution separated into three phases.
The Explorations in statistics: hypothesis tests and P values provides a date of 1279 as the origin of Hypothesis testing. The Royal Mint from London used a sample of coins made from each run of the mint which were compared against a known set of standards. I welcome the reader to click on the third reference given above to read this in more detail.
The article also speaks about William Sealy Gosset (Student) and his t-test method. What struck me most was the description of Gosset explaining the significance of a drug in terms of an odds ratio. This was well before the advent of p-values to determine significance of the data.
First let us see what is the probability that [drug A] will on the average give increase of sleep. [Looking up the ratio of the sample mean to the sample standard deviation] in the table for ten experiments we find by interpolating. . .the odds are .887 to .113 that the mean is positive. That is about 8 to 1 and would correspond to the normal curve to about 1.8 times the probable error. It is then very likely that [drug A] gives an increase of sleep, but would occasion no surprise if the results were reversed by further experiments.
2) Sir Ronald Fisher:
It was Sir Ronald Fisher who clearly came up with the idea of a null hypothesis (H0) and the use of a conditional probability p-value to make a decision based on the data found. He termed this as “Significance Testing”. The main distinction here from the texts today, is that Fisher only used Null or Nil Hypothesis. He did not find value in the alternate hypothesis. His thought process was that if the p-value was less than a cut-off point (let’s say .05), this would indicate that either this was due to a very rare event or that the null hypothesis model was wrong. More than likely, it is highly probable that the null hypothesis model was wrong. Fisher did not see a need for an alternate hypothesis nor the need for repeating tests to see how powerful the test was.His method is based on Inductive Inference.
Fisher never also meant to use only .05 as the cut-off value. He viewed p-values as inductive evidence against the null hypothesis.
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.
3) Neyman-Pearson Hypothesis Testing:
The books “Lady tasting tea” and “The empire of chance” go into detail about the “feud” between the great minds Fisher, and Neyman/Pearson.
It was Neyman and Pearson who came up with idea of using an alternate hypothesis (H1) and testing it against the null hypothesis. Additionally, they also created the idea of the power of a test, and introduced the ideas of type I and type II errors. They termed their version as Hypothesis testing.Their version is based on inductive behavior.
They defined alpha, beta and power as follows.
alpha = P(reject H0|H0 is true)
beta = P(fail to reject H0|H0 is false)
power = 1 – beta
Where we are now:
What we use and learn these days is a combined method of Fisher and Neyman/Pearson. The textbook method is generally as follows;
1) define null and alternate hypotheses.
2) set an alpha value of .05, and power value of .80 before the experiment.
3) calculate test statistic and p-value based on the data collected.
4) Reject or retain (fail to reject) null hypothesis based on the p-value.
Critiques of this combined method claim that the combined method utilizes the worst of the two methods. They emphasize the focus on effect size, and the use of confidence intervals to provide better view of the problem at hand, rather than blindly relying on the p-value alone.
Keep on learning…