(Sigma is the Greek letter σ)

Medical science and especially epidemiology treat 2-sigma as a kind of ‘gold standard’. This article attempts to explain what that means and why it is a scandal.

There has always been a magical appeal in the notion of extracting information from apparently random events. Early examples include the ‘I Ching’. However, in recent centuries, the development of probability theory has provided a rigorous mathematical understanding of the nature of chance. So providing a basis from which to develop methods for analysing statistical data.

Amongst others, Sir Ronald (Ralf) Fisher otherwise known as the ‘father of modern statistics’; developed a method for designing statistical tests and arriving at a result which would provide a level of ‘confidence’ in the conclusion. The calculations involved in these tests were complex and he envisioned them being deployed by professional, qualified statisticians equipped with log tables and slide rules. He suggested that a confidence level of 95% might be a useful benchmark for a one-off randomised controlled trial. Without going into detail, this equates to 2-sigma (1.96 to be precise).

There are many different types of statistical tests that can use these methods. However the most common one in junk science is a measure called ‘relative risk’ (RR). Note that some studies calculate on ‘odds ratio’ (OR) instead. The OR is very similar to an RR but will generally produce a larger figure.

The idea is that you compare the incidence of something (e.g. disease or death) in a group exposed to a substance under test, with a group that is not exposed (the control group). The resulting ratio is the RR. This is expressed as a point estimate, along with a range of values within which there is confidence that the real value lies. This is called a confidence interval, normally written as “CI = (n to m) at 95%”.

To run a statistical test it is critical that other factors are excluded. Fisher laid great emphasis on this. The only way to exclude other factors is to run a ‘randomised controlled double blind trial’. ‘Randomised/controlled’ means that the test population is split into an ‘exposed’ group and a ‘non-exposed’ group (the control group). And that the selection of each group is not biased by any other factors, known or unknown, hence ‘random’. ‘Double blind’ refers to the important issue that neither the researcher nor the subject knows who is in which group. At least until the trial ends.

It is also vital that the trial is fully defined before it starts. For example if a trial is specified as running for 1 year then it must be run for one year and neither cut short nor extended.

To understand the importance of this, consider a drunken walk. The drunk randomly takes a step to either right or left. So a trial to see if he is biased to the right or left might consist of looking at say 100 steps and reporting based on where he ended up. If the researchers have the option of earlier or later termination then they could simply wait until he was a long way off centre and report that as their result. Obviously he might well veer a long way off centre during the trial but that is not a valid test. The only thing that is valid is his final position at the pre-specified end.

The trouble is that, with the advent of computers, anyone can perform such a test. All you need is a statistical package or even a humble spreadsheet (see below). What’s more, epidemiologists don’t bother with ‘randomised controlled trials’ and instead just observe data from the real world with all its inherent complexity. This introduces confounding factors and, potentially, researcher bias.

So whereas Fisher anticipated a small number of experienced statisticians performing occasional tests and interpreting the results with care, we now have thousands of epidemiologists, with little or no understanding of the statistical limitations, performing vast numbers of tests at the click of a button. It is no wonder that there is so much junk science around! At best, it is to be expected that 5% of their results are in error and in reality,  many more. This is partly because they can decide which results to publish and partly many other reasons including premature termination (examined below)

There is a very simple improvement that could be made though. Particle physicists  insist on at least 3, 4 or 5-sigma tests. The difference in confidence is enormous. A 3 sigma test gives a confidence level of 99.7%, 4-sigma 99.99% and 5 sigma 99.99999%.

So how to design such tests?

All you need to do is to increase the sample size. If you have seen the results of a trial that is at 2-sigma (95%) then treating that as suggestive you simply try your own version but double the size of the sample. Then, if there was any substance in the original you should achieve a significant result at 3-sigma(99.7%).

Specifically, to replicate a 2-sigma test at 3-sigma, just multiply the sample size by 9/4 (3 squared divided by 2 squared). To move from 2 to 4 the factor is 16/4 and so forth.

For those who wish to verify this for themselves, here is a rather scruffy spreadsheet. It includes a little detail on the intermediate steps to calculating confidence intervals and RR/OR values. And also a little toolbox for calculation of sample sizes so as to re-test at different confidence (sigma) levels.

Note that medics almost exclusively use 1.96 sigma. So there are hundreds of passive smoking studies all attempting to reach this magic number, when what should have been done was to treat the first few as indicative and then run a better test with a larger sample.

Pharmaceutical companies use these same flawed ‘standards’ for drug trials too. And there is no (advance) registry of drug trials to constrain their abuse of method.

See also: medcalc

Finally, please remember that all these tests can do is to find, or fail to find, correlations and correlation does not imply causation.

Premature Termination

One of many flaws in the medical use of statistics is that researchers often have the freedom to terminate a trial early. That may not sound like such a big deal but in principle it could reduce the confidence level from 95% to more like 50%. Even at 3 sigma this is a problem because it could reduce the actual confidence level to around 95%. Of course, in the real world, researchers would not have quite as much flexibility as to when to stop so the above figure should be seen as purely theoretical worst case scenario.

To verify this for yourself, try this rather rough and ready programme here. You will need to have ‘java’ enabled in your browser. Please feel free to examine, save and edit the code for your own use by viewing the html source on your browser (‘view-source’ in IE or ‘tools-developer-page source’ in Firefox), copy/pasting into a text document using windows notepad (or better still notepad++) and saving as e.g. myprog.html. Then you can edit it and test by opening the text file with your browser.

The programme runs 2000 trials of 5000 coin tosses each. Note that with more coin tosses, the confidence level decreases even further. The earliest termination allowed is after 20 tosses as it is only at that point that an approximation to the normal distribution can be said to exist (not that medics would worry about such niceties).