Astronomy is easy with a ground-breaking instrument: just point the instrument at an object and new features jump out. This happened with the current generation of x-ray telescopes, the Newton-XMM and the Chandra X-ray Observatory, which are able to resolve emission lines from highly-ionized atoms in the hotter regions of the universe. Where earlier instruments saw a smeared hump in the spectrum, these instruments see a forest of lines standing above a smooth spectrum. For bright sources these lines are easy to measure.
The tricky measurements in astronomy are when features are weak, so that we are not certain of their existence. Elements of chance enter into the detection; the number of x-rays that reach the detector in a second has some randomness to it, as does the precise energy we measure for the x-ray. Here we must resort to statistics to decide if a line we see is real or a statistical fluke. But statistical analysis has its pitfalls, and one of the largest is that it produces a surprising number of false-detections.
Back when the controversy over the origin of gamma-ray bursts was raging, of whether their sources are within our galaxy or outside of our galaxy, an apparent clustering of gamma-ray bursts towards the center of the galaxy appeared, which suggested that gamma-ray burst sources are in our galaxy. But the significance of the effect was not great: in the jargon of statistics, the measured clustering deviated by 2.5 standard deviations—usually called 2.5 sigma (σ)—from a random distribution of bursts on the sky. The chances of get this level of clustering towards the center of the galaxy by randomly throwing sources on the sky is less than 1%. You would look at this and say clearly this must be a real effect, and in fact at the conference where this result was presented as an upper limit on the amount of cluster rather than as a detection, one researcher did get up and said, “Why are we calling this an upper limit on cluster? I say we should call this a 2.5 sigma detection!” But the effect disappeared as more gamma-ray bursts were observed, so this highly significant result was nothing more than a statistical fluke.
This odd occurrence, the detection of something improbable that subsequently disappears, is surprisingly common. In astronomy, three standard deviations from the expected result is generally accepted as the minimum criteria for accepting a signal as being significant. The chances of a three-sigma signal arising through statistical fluctuations is about 0.5%, but I have the sense that many more than one out of 200 three-sigma results are false detections. While I have no numbers to back this feeling up, my gut sense is that nearly half of all three-sigma detections turn out to be statistical flukes.
A high rate of false-positives is not confines to astronomy; a recent paper published in the medical literature examines this effect for highly-publicized studies. This study found that about 15% percent of the studies turned out to be wrong, and another 15% percent found an effect that was stronger than found by follow-up studies. As a Wall Street Journal editorial makes clear, these false detections can harm the public and damage the reputation of the medical profession. In the astronomy profession, they simply harm the reputation of the researcher.
Why should this happen at all? The reason is that the probability we use to judge the significance of a detection does not match the way we conduct research. The probability refers to a single experiment, while in the scientific community many different experiments are performed by many different researchers.
I recall as a graduate student seeing a fellow graduate student search for an effect by plotting a set of data in every conceivable manner. One after another, she pulled up scatter plots of one variable versus another of data selected under a variety of criteria. I asked her what she was doing. “I'm looking for a signal,” she replied. I suspect that if she had found one, she would not have included the total number of tests she performed to decide if the signal was significant. But in a sense the whole scientific community is performing under precisely this protocol.
The whole process is similar to gamblers walking into a casino and spinning the roulette wheel. The chances of a single gambler winning on a single number with a single spin are low, but as the number of spins increases, his odds of winning on that single number go up. And if we consider the odds of a single number winning for all of the gamblers, we find that despite the long odds, a large number of gamblers will win on a spin. In science, each experiment is one more spin of the roulette wheel, and the more experiments done by the community, even if each member of the community performs only one experiment, the more likely an unusual event will arise through chance. Unlikely results are always published in the scientific journals, so it is inevitable that a certain fraction of unlikely results will be false detections.
What fraction of detections will be false detections? Nature sets the number of real signals that can be detected, and they can be few or many, but science itself set the number of statistical flukes that will be detected by the number of tests performed within the community. Because there is no way to count the number of tests done within the community, we cannot calculate the number of fluke detections to expect. This means that virtually every detection can be a signal, or every detection can be fluke. The only way to distinguish between the two is to treat a low-probability event as a prediction rather than as a detection, and test if the signal persists in new data. The statistical probability of obtaining a highly significant signal in the second test is then an accurate measure of the chances of obtaining the result through a random fluke. This is one reason why we continually revisit an experiment, for even the best researcher can be unlucky and produce a fluke result.
Freddie Wilkinson