Chapter 2 Spotlight D: Statistical significance

There are many books that aim to teach statistics to people who feel less than confident in the topic.  If your professor recommends one of them then clearly that is your best starting point.  

I got my understanding of descriptive statistics by reading Darrell Huff’s How to Lie with Statistics  – the 1973 edition with fun little pictures by Mel Calman. Huff’s book has never been out of print since it was first published in 1954 so you can pick up second-hand copies easily and any library ought to be able to lend you one. I lost mine, possibly because I lent it to someone who failed to return it.

Two books helped me to understand more about the ‘why’ of statistics. I felt that I was able to get more out of them once I already had a relatively good grasp on basic descriptive statistics and a beginner’s acquaintance with inferential statistics: 

R.A. Fisher, the inventor of p-values, never specified that the requirement for statistical significance was 0.5 (95% or ‘5 in a hundred’). He did give one example of a test with a ‘one in a hundred’ outcome, which is p < 0.1(Salsburg 2002)

Statistical significance (full version)

In the Introduction to my book , I talked about survey as a quantitative method. In other words, the result of a survey is a number that you can use to make a decision. 

The question can then arise: Is this result statistically significant?  

Statistical significance is different from significance in practice 

Here are my definitions of significance, based on The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, P.D. Ellis,  Cambridge University Press 2010.

statistically significant result is one that is unlikely to be the result of chance.  

A result is that is significant in practice is one that is meaningful in the real world. 

Statistical significance is closely related to sampling error: the choice to ask a sample rather than everyone in your defined group of people. There are many different ways of doing the calculations and they are called ‘statistical tests’.  

statistical test for significance is a calculation based on the data you collect, sampling theory, and assumptions about the structure of that data and of where it comes from. 

 image missing here

Figure S3.1 Sampling error happens when the sample you ask is smaller than the list you sample from 

For example, there are different tests for data that is continuous or discrete, and for paired observations or single observations. There are tools online to help you choose the right test, such as: Choosing the Right Test. 

In contrast, the whole Survey Octopus is about all the choices that you make to achieve a result that is significant in practice. One thing that makes a result useful is considering the effect: 

An effect is something happening that is not the result of chance. 

A result can be statistically significant but not significant in practice 

Any statistical test can result in a false negative:  a false negative result is one that fails to find an effect. Also known as: “missing something that is there” or Type 2 error. 

The probability of a false negative is called β. 

Statistical power is the probability that a test will correctly identify a genuine effect (Ellis 2010). 

The more datapoints that you have for your test, the more likely you are to identify an effect, and the more powerful your test will be – statistically, but not necessarily usefully.  

Here’s an example from a real experiment on an electronic customer satisfaction survey, aimed at persuading people to switch from completing a survey on their smartphone to completing it on a PC (Unintended Mobile Respondents, G. Peterson, CASRO Technology Conference, New York, 2012,  quoted in Peterson, G., J. Griffin, J. LaFrance and J. Li (2017). “Smartphone Participation in Web Surveys”, in P. P. Biemer, E. D. de Leeuw, S. Eckman, B. Edwards, F. Kreuter, L. Lyberg, C. Tucker, B. T. West (eds), .Total Survey Error in Practice, Hoboken, New Jersey, Wiley, 203-233).  There were three “treatments”: 

  • “Invitation” – a message in the survey invitation telling them that the survey worked better on a PC than on other devices 
  • “Introductory screen” – a similar message, but in the first screen of the survey 
  • “Control” – no messages. 

The researchers kept track of the proportion of people who answered who switched from smartphone to a PC.  

  Treatment 
Invitation  Introductory screen  Control 
Number in each group  3,115  3,154  3,042 
Did not switch device  99.7%  99.1%  99.6% 
Switched from smartphone to PC  0.3%  0.9%  0.4% 

In this example, the sample sizes are so large that the differences between groups are statistically significant, but effect size is too tiny to have any significance in practice. It demonstrates the risk of picking a large sample size without considering the size of effect that is useful for making the decision. But we avoided a false negative: there was an effect size, and we found it. 

The opposite error is a false positive: a false positive result is one that detects an effect when in fact the variation in the data is due to chance. Also known as: “seeing something that is not there” or Type 1 error. 

The probability of a false positive is called α (Greek letter alpha).   

    What is actually happening 
    Chance only  An effect 
What we observe in the data  Looks like something is there  Seeing something that is not there 

False positive 

α 

Correct 
Looks like nothing is there  Correct  Missing something that is there 

False negative 

β 

 A result can be significant in practice but not statistically significant 

The worry about seeing an effect that is not there dominates statistical significance testing, which can lead to rejecting something that lacks statistical significance but is significant in practice. 

For example, statistician Professor Sir David Speigelhalter tweeted about a medical paper in Figure S3.2. 

 Image missing here

Figure S3.2 

“This paper motivates the call for the end of significance. A 25% mortality reduction, but because P=0.06 (two-sided), they declare it ‘did not reduce’ mortality. Appalling.  Courtesy of https://twitter.com/d_spiegel/status/1110477993317679104 

If you would like to read the paper yourself, it is: Hernández, G., G. A. Ospina-Tascón, L. P. Damiani, E. Estenssoro, A. Dubin, J. Hurtado, G. Friedman, R. Castro, L. Alegría, J.-L. Teboul, M. Cecconi, G. Ferri, M. Jibaja, R. PairumaniP. Fernández, D. Barahona, V. Granda-Luna, A. B. Cavalcanti, J. Bakker, f. t. A.-S. Investigators and t. L. A. I. C. Network (2019). Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status vs Serum Lactate Levels on 28-Day Mortality Among Patients With Septic Shock: The ANDROMEDA-SHOCK Randomized Clinical Trial. JAMA 321(7): 654-664. . They compared two treatments for people with septic shock. For one treatment, 74 out of 212 patients died; for the other, 92 out of 212 patients died. If I were a relative of one of the extra 18 patients who died on the less-good treatment, I would hope that the experimenters did more work before rejecting the better treatment because of a statistical quirk.  

Which brings me to “p=0.06” in Professor’s Speigelhalter’s tweet. Let’s look at that now.  

Statistical significance relies on a p-value 

Somewhat arbitrarily, the definition of ‘unlikely to be the result of chance’ is usually given as “p < 0.05”, so we get: 

  • statistically significant result is one where p < 0.05 

p is the usual abbreviation for the p-value:  

  • The p-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis was true (Vickers 2010) 

When we choose to define “unlikely” as “< 0.05”, we are accepting a risk of seeing an effect that is not there at the rate of 0.05 in 1, or 5%. The value 0.05 is the α. 

The smaller the probability that you choose for α, the less risk of a false positive. The professor’s objection to the paper was that the authors rejected a treatment on the basis of a result that would be “unlikely to be the result of chance” if you allowed the risk of seeing something that is not there to be 6%, but not if it had to be better than 5%. 

Another concept that is closely related to α is ‘confidence level’. 

  • The confidence level is 1- α expressed as a percentage.  

If you choose α as 0.05, that’s the same as a confidence level of 95% or a 1-in-20 chance that you may get a false positive. To make it less likely that you will get a false positive, you need a smaller α – typically, sample size calculators will offer 0.01, equivalent to a confidence level of 99% or a 1-in-100 chance that you may get a false positive.  

Statistical significance does not guarantee good decisions  

You will see from the two examples that although statistical significance can tell us something about the risk of a false positive, seeing an effect that is not there, there is nothing in the definition that considers a false negative, missing an effect that is there. 

There’s also nothing in the definition about the size of the effect. A big sample size can find a very small effect. A small sample size might miss a very big effect.  

“Statistical significance” says nothing about the quality of the decision 

Statistical significance is about the mathematics of the decision process, not the quality of the decision. For “quality of decision” we need significance in practice, which comes mostly from the other topics in our Survey Octopus 

Statistical significance says nothing about the choice of test 

The amount of data you collect makes a difference 

We started on Statistical Significance because of wondering about how many people to ask to get a statistically significant result. 

Statisticians like means and Normal distributions 

the whole grain example seems to be missing here. 

Statisticians refer to the shape of a dataset as “the distribution”. You may have recognized that the distributions of all six sets of test results, three handfuls and 33 handfuls for each farmer, are looking somewhat like the bell-shaped curve of the Normal distribution.  

I didn’t make that happen by the way I made up the examples. It’s the way that means work: when you take a random sample: 

  • you’re more likely to get frequent values than infrequent ones 
  • the frequent values have the biggest influence on the mean of the entire data set 
  • the bigger your sample, the more likely that the mean of your sample is pretty close to the actual mean of the entire data set.

And you can prove it, too: the Central Limit Theorem tells us that, conveniently, as you take more samples and average them, the distribution of those means gets closer to a Normal distribution. 

A distribution says how many you have of a value 

To pick a test, we need to understand the distribution we are working with. Here’s a definition: 

  • distribution describes how many there are of each possible value in a dataset. 

Distributions come in all sorts of different shapes. We met one when we looked at how the zone of response might be affected by people with strong feelings:  

 image missing here

Figure S3.15  The distribution of this zone of response has two peaks reflecting strong feelings 

The Normal distribution is a common one and there are many more.  

 image missing here

Figure 3.16  Plushies portraying 10 of the distributions most often used by statisticians. Normal is the green one at the front. Image credit: www.nausicaadistribution.com 

Look at the distribution 

Let’s have a look at the distributions for the three batches of barley that we considered for our brewer. They are all pretty close to Normal distributions, with just about the same widthways spread (standard deviation) but with the peak (which for a Normal distribution is always the same as the mean) in different places.  

 image missing here

Figure S3.17   All the handfuls from each of the farmers.  

Statistical tests do not protect you from wrong assumptions 

Before we discussed variances and standard deviations, I mentioned that a successful test of statistical significance has three parts: 

  1. The mathematical manipulation – the statistical test – that you do on the data. 
  2. The assumptions that you make about the data and whether those assumptions match the underlying mathematics. 
  3. The amount of data that you give to the mathematics.

If your stakeholders are genuinely interested in statistical significance, it’s important to have a discussion with them about what tests they consider to be appropriate for the sort of data that you are collecting, and the decisions that you are making. If you want some advice of your own, many universities have helpful pages for their students about choosing tests, such as https://www.sheffield.ac.uk/mash/statistics/what_test .

Most of all, statistical tests generally assume that you have a random sample. If you don’t – for example, if you chose to sample by ‘snowball up’ – then the tests will also deliver something that looks like a result even though they have not worked.  

The probability of the data given the hypothesis says nothing about the hypothesis 

Let’s have another quick look at those definitions. 

  • statistically significant result is one that is unlikely to be the result of chance. 
  • statistically significant result is one where p < 0.05. 
  • The p-value is the probability that the data would be at least as extreme as those observed, if the null hypothesis was true (Vickers 2010).

We are aiming to reject the null hypothesis by looking at the probability of the data given that the null hypothesis is true. 

It’s crucial not to confuse “the probability of the data given the hypothesis” and “the probability of the hypothesis given the data”.  

Personally, I find it easier to sort out conundrums like ‘probability of x given y’ compared to ‘probability of y given x’ by thinking: am I given the Popes or the Catholics?  

As I write, there are around 1.3 billion Catholics and two of them are Popes: Pope Francis (current) and Pope Benedict (retired). Try this: 

A. “The probability that we’re looking at a Pope, given that we sampled from Catholics.”  

B. “The probability that we’re looking at a Catholic, given that we sampled from Popes.” 

Answers: 

A is: 2 chances in 1.3 billion
B is: 2 chances in 2 

Whatever you think about Popes or Catholics, I hope you can see that the two probabilities are dramatically different.  

The bottom line for us on statistical significance is that we can’t prove that the null hypothesis is true from our data. We can only look at the data we have and decide whether or not that data is probable if the null hypothesis was true.  

Before you claim statistical significance, use this checklist 

Let’s recap what we needed previously to know whether something is statistically significant, which was that you must first decide: 

  • What the effect is 
  • What the null hypothesis is 
  • Your α, your view of the amount of risk you can tolerate of “seeing something that is not there” (a false positive or type 1 error).  

And now we’ve added on: 

  • What the distribution looks like 
  • The assumptions you’ve made about what you are testing relative to the distribution 
  • Whether the statistical test that is appropriate for the data you have collected does in fact work correctly given the distribution that you have and your assumptions.  

Finally, we also always need to be clear on these related questions before we collect any data:  

  • Does this help me to answer the Most Crucial Question? 
  • How big an effect do we need for the result to be useful? 
  • What statistical power do we need? Or in everyday terms, how are we going to set our power, 1 – β, the probability that we will not miss the effect if it is big enough.  
  • Can we afford the cost of getting the number of people that our effect size and β require us to find to respond to our survey? 

Many leading statisticians reject the over-use of statistical significance 

After all of that, you may be cheered to know that in 2019, over 800 leading statisticians signed a call to rethink “statistical significance” and put it in its place as a statistical tool with only limited applications. A summary of their views published in Nature, (s in figure 3.21 under the headline Scientists Rise Up Against Statistical Significance (Amrhein, V., S. Greenland and B. McShane, 2019, www.nature.com/articles/d41586-019-00857-9) and an entire special issue of American Statistician – volume 73 2019, supplement 1 – dedicated to detailed discussion of the reasons for this and the implications.  

For example, the leading editorial in the special issue recommends ATOM: 

Accept uncertainty. Be thoughtful, open, and modest (Wasserstein, Schirm et al. 2019) 

Personally, I treat that as an instruction to go back to “significance in practice” and think about the whole Survey Octopus. 

 image missing here

Figure S3.21 Scientists rise up against statistical significance https://www.nature.com/articles/d41586-019-00857-9 

Confidence level is another way of looking at it 

See comment earlier about actual details of the example needed in this section This takes us back to our brewer looking at a result of 85 and wondering which sort of batch it was more likely to have come from: one with a peak at about 84, or with a peak at about 89, or with a peak at about 93.  

 image missing here

Figure S3.23 A handful at 85 compared to three possible batches 

To express this mathematically, we have to decide what we mean by “likely” – and that’s the ‘confidence level’ (usually 95%) that we met a few pages ago (do you need to reference the book page number here?). If you’ve got some data, and you can make a guess at the standard deviation of the distribution that it came from, then you can work out a confidence interval.  

To get to a confidence interval, you make these assumptions: 

  • The most likely distribution that this result came from is the one with the mean the same as the current result. 
  • It’s appropriate to use the standard deviation of the data that we have as an estimate for the standard deviation of the whole population. 

When you’re working from the data you’ve got, to thinking about what you can estimate about the population, it’s called a confidence interval.  

Confidence intervals have more flexibility, and fewer of the oddities that made all those statisticians reject statistical significance – so the same statisticians recommend using them, and who am I to disagree?  

Margin of error flips the confidence interval around 

If we look at what happens with larger numbers in our eventual sample, you can see that as you increase the sample size, the confidence interval gets narrower. 

Mean  Standard deviation  Sample size  Confidence interval 
90  3.6  3  90 ± 75 
90  3.6  33  90 ± 10 
90  3.6  100  90 ± 6 
90  3.6  500  90 ± 3 

Instead of asking about statistical significance where something either passes or fails against your required confidence level, it’s possible to turn the idea around and say: how precise a result do we need to make a decision?  

The margin of error is the level of precision that you need in your result. 

Which brings us to the last item in our checklist for a sample size for statistical significance. Here’s our to-do list: 

  • population size  
  • confidence level 
  • margin of error. 

So you, or your stakeholders, can say: “We want the confidence interval we get eventually to be no more than ± x at 95% confidence level”, plug those numbers into a sample size calculator – and there you are, your sample size.  

Be warned: you’ll still need a random sample. If you’re planning to ‘snowball up’, then you won’t have a random sample and you’ll have to work out the sample size you need some other way.  

Related to that, there are the things that you need to use a sample size calculator, which are: 

  • population size  
  • confidence level  
  • margin of error  

Let’s do an easy one first to get us started. 

The population size for a survey is the number of people in your defined group.  

For example, if your defined group is “current customers” then your population size is the number of current customers. That’s one dealt with.  

  • population size  
  • confidence level 
  • margin of error