That’s a question every psychology student has asked at one time or another! Well, I’ll tell you.

In order to understand Bonferroni, there is some prerequisite knowledge you need to possess. You need to understand what null hypothesis significance testing is, p values, and Type I/Type II errors. If you understand these things, read on. If not, read on also, but this will make less sense to you (I haven’t yet covered these things on this blog, so you’ll have to do some Googling, or buy my book where it is all explained in the absolute best way that is humanly possible. Ahem).

You’re still with me! That’s good. I wonder what percentage of readers have already pressed the back button? Hmm.

So, Bonferroni correction. You know that with a p value set at .05 were looking for a less than 5% chance of getting our result (or greater) by chance, assuming the null hypothesis is true. 5% is an arbitrary significance level (or ‘alpha’); not too high that we’re making too many Type I errors (assuming an effect where there isn’t one), but not too low that were making too many Type II errors (assuming there isn’t an effect where there is one).

Imagine that we did 20 studies, and in each one we got a p value of exactly .05. A 5% chance of a fluke result over 20 studies means it’s odds on that one of these results really was a fluke. Now think about how many thousands of studies have been done over the years! This demonstrates the importance of replicating studies – fluke findings have definitely happened and will continue to happen.

However this situation isn’t limited to findings spread over multiple papers. Sometimes in larger papers with several studies and/or analyses rolled into one, you might get a similar predicament. Simply, the more tests you do in a paper, the more chance there is that one of them will have come about through pure chance.

This would be a bad thing – a theory that is modified as a result of an incorrect finding would, of course, be a weaker reflection of reality, and any decisions that were made based on that theory (academic or not) would also be weaker.

So, we need a way to play a little safer when doing multiple tests and comparisons, and we do this by changing the alpha – we look for lower p values than we normally would before we’re happy to say that something is statistically significant.

This is what Bonferroni correction does – alters the alpha. You simply divide .05 by the number of tests that you’re doing, and go by that. If you’re doing five tests, you look for .05 / 5 = .01. If you’re doing 24 tests, you look for .05 / 24 = 0.002.

Bonferroni correction might strike you as a little conservative – and it is. It is quite a strict measure to take, and although it does quite a good job of protecting you from Type I errors, it leaves you a little more vulnerable to Type II errors. Again, this is yet another reason that studies need to be replicated.

There you go! An answer to an age-old question. Up next; does the light in the fridge stay on when the door is closed??

Dear Warren

Your post is commendable to say the least. You are my hero when it comes to Bonferroni corrections. Very easy, very lucid, and very accessible explanation.

Thank you very much.

Best and quirkiest explanation ever. I love it! Thanks for helping me with my exam prep!

Thank you- by far, the simplest explanation and the easiest to understand. Love statistics!

“5% is an arbitrary significance level (or ‘alpha’); not too high that we’re making too many Type I errors (assuming an effect where there isn’t one), but not too low that were making too many Type II errors (assuming there isn’t an effect where there is one)”

Concerning type 1 and 2 errors, it’s the opposite, no?

Nope. If alpha is higher, you’ll tend to make more type ones, if alpha is lower, you’ll tend to make more type twos.

Great post!

One thing that I’ve been unclear on with Bonferronis is what counts as “multiple tests”. If I am running 20 tests in my study, but 10 of those tests are to do with one explanatory variable (weight, levels 1-3 for example), and 10 are to do with another explanatory variable, where do I apply the correction? Do I apply a correction with n=20 to all values? or a correction of n=10 to each explanatory variable? or some other correction taking into account the 3 levels of each test?

This was a really good explanation of WHAT a bonferroni correction is. It would be really great to see an expansion in another post (or in this one) about how the correction can be applied and really helpful!

Dean

That’s a really good question, and statistically speaking it shouldn’t matter. If you do 20 studies of different topics, doing one test per papers, and get p = .05 for each of them, statistically it’s odds on that one of them is a false positive. However, you’ll note that research reports don’t tend to do this, and that’s because there are arguments against it.

I have never come across a hard and fast rule for this and I don’t believe there is one, so I can’t answer your question, but I can ramble on for a bit.

In my opinion, certain tests in a paper aren’t a problem. One example is, they might do a t test on the ages between groups, or do a few bivariate correlations of demographics against the DV but not include this when correcting p values. That’s fair enough in my book, since with these tests you’re just looking for large differences that might upset the results, rather than trying to protect yourself against false positives. So when I have my devil hat on and I’m looking for flaws in a paper, I don’t worry about that.

Note also that Bonferroni also increases your chance of making a Type II error, so it has its own pitfalls too.

I’ve noticed that in many papers the researchers have adjusted for multiple comparisons on each separate test. So say you have 2 ANOVAs with 10 follow up tests each, what you’d probably see is an adjustment of 0.05 / 10 for each set of tests, rather than 0.05 / 20 for all.

Some argue that planned comparisons don’t need to be adjusted, some say they aren’t needed when the previous research strongly points to their being an effect, some say a different method of adjusting would be preferred in cases like that, some say in certain types of research adjustment is less necessary than others, and on and on and on.

It’s more important, I think to be aware of the consequences of adjusting and not adjusting, be cautious about drawing conclusions from results that haven’t been corrected – especially when there have been no replications – and keep your skeptical hat on.

Hope this helps even a little bit.

http://www.jerrydallal.com/LHSP/mc.htm

http://www.ncbi.nlm.nih.gov/pubmed/2081237

Just double checking,

If I am studying the performance of the same group of people using 4 different procedures (A,B,C,D), and I have to compare the performance, what’s the number of tests and final p value?

Tests: AB, AC, AD, BC, BD and CD (6) tests?

therefore, p after Bonferroni would be 0.05/6=0.0087?

Hamilton.

I make it 0.0083, but yep, that’s right.

hmm. so i’m doing a gene-gene vs. gene-environment analysis. i have 3 environments and 10 genes to look at with only one dependent variable (y) and i plan to do a regression of each on Y (i think that what i’m supposed to be doing), does that mean my bonferroni calculation would be .05/13 ?

If you’re doing 13 separate regressions, yes.

i measuring 16 different items in 3 groups of people using anova.will the bonferroni correction be 0.003.how do i report the analysis done.can i use the term bonferroni corrected anova test

if you’re doing 16 separate tests, yes, 0.003125.

But are you doing follow up comparisons? Or just 16 ANOVAs?

Personally I would say “x tests were conducted with alpha adjusted to x for multiple comparisons.”

One quick question. I have used a ch-square test to compare use of different habitats (9 habitats). Is the correction made for the alpha value for the original test or to each pair-wise comparison? Or both? I assume that the correction would be 0.05/9?

Thanks

I’ll be honest with you Rachel, I haven’t used a Chi Square since I was sat in Research Methods classes 6-7 years ago, but yes you’d divide your alpha by the number of pairwise comparisons you were doing.

THANK YOU!!!!!!!!!!!!!!!!!

YOU’RE WELCOME!!!!!!

I hope i can ask this question of you I would greatly appreciate help. I ran a one way ANOVA in SPSS and I have 3 groups. I had to disregard some of my results because the replications were inadequate and are disqualified for any statistical analysis. I have a first group, a second group, and a third group and they are ordinal values because its ages of 5, 10 and 40. Each group started with n=8 but for a total of 24 subjects, but after eliminations there are fewer values, so its something like n=5, n=7, n=6. I ran a Bonferonni multiple comparisons test in SPSS and I get significance values of p=0.1. However, given what you’re saying I have only performed one test so my alpha value is .5. for me to reach significance. Is that right?

If you’re doing multiple comparisons on 3 groups that’s three tests – group 1 versus group 2, group 1 versus group 3 and group 2 versus group 3.

However, if you tick the “Bonferroni” box in SPSS, it automatically adjusts your values so you can compare them to .05. So just look for p values under .05 as normal.

So it would be correct to do 0.05 / 3 as my p value? This is for biological testing of gene expression. I suppose I would have just performed a t-test if I thought that was the way to go about testing the difference between two groups but seeing as i am using all the groups for analysis I should reduce the p value. Thanks heaps for your help

To clarify;

If you’re pressing the Bonferroni tick box in SPSS when performing an ANOVA, then look for p values less than 0.05.

If you perform three separate t tests and want to adjust using Bonferroni correction manually, then yes you’d use 0.05 / 3 = 0.0166 as your alpha.

Hi – just wanted to ask another question on this. have performed ANOVA & found significance, therefore conducted post hoc paired samples t test on 20 pairs (10 items being tested at pre to post, post to follow up stage of intervention). Have used bonferroni adjustment due to multiple comparisons – and as you said it states within SPSS ouput, mean difference is significant at .05 level (with a * next to each relevant pair). The significance level on a couple of items (*) is greater than .0025 (.05/20) however [e.g. .028 & .006]. Is this still signficant because i thought signif level for these would be <=.0025?? (i hope this makes sense…)

If you use Bonferroni correction in SPSS you always look for 0.05, because it automatically adapts the scores for you to make it easier to interpret. Try running the analysis without selecting to correct for Bonferroni, and you’ll see that the same ones reach significance at your calculated alpha as do at 0.05 with Bonferroni turned on.

thank you so much – whoever you are!!

can i just ask another question… are paires sample t tests + bonferroni appropriate post hoc tests to run? is this stringent enough or should i be running tukey or sheffe?

You have a great explanation on Bonferroni! But one thing I want to confirm with you about my study. I have a non-significant p-value in comparisons of 3 sample means by using ANOVA. That means I can state this n.s. p-value without proceeding further to do Bonferroni, right?

Yeah that’s right — if it’s not significant before Bonferroni, there’s no need to use it.

This is quite honestly the most lucid explanation of a statistical procedure I have ever read. The fact that I did not need to read it more than once is testament to its Nobel-worthy qualities. Many thanks.

Great post!

But I’m not totally sure if this example applies and makes sense.

With two configurations:

A) one fire alarm system triggering 5% of the times when there’s no fire and

B) 10 fire alarms sensors in the same room triggering .05% when no fire

I would bug the fireman with no reason the same number of times, right?

And in configuration B) I would also have a greater risk of nothing triggering when there’s actually something burning?

emainorbit

Actually that’s

B) 10 fire alarms sensors in the same room triggering

0.5%when no fireBonferroni won’t make up for distraction anyway

Thanks for the kind words! You’ve pretty much got it, emainorbit, but let me elaborate a little. The more fire alarms (statistical tests), the more likely you are to bug the fireman (assume an effect where there isn’t one, or make a Type I error in other words).

So you test your fire alarm to see if it goes off without a fire. You do this 20 times and it goes off once, so 5%. If you have 10 fire alarms in the same room, each one having a 5% chance of going off even without a fire, there’s a much higher chance you’ll bug the fireman, because each one has it’s own individual 5% chance of going off. So you can add up the percentages – 5 * 10 = 50% chance of bugging the fireman.

Where this analogy breaks down, is that in psychology 5% chance of type I error is considered acceptable, while in a fire alarm you want a 0% chance of it going off. But lets just say you’re doing some kind of training drill for the firemen, and you actually want the alarms to have a 5% chance of going off.

5% is a decent amount, you reckon – they won’t be able to predict when it will go off, could be sooner, could be later, so you’ll be able to test how quickly they get ready and get to the scene. But if you have 10 fire alarms in the room, you actually have a 50% chance. So what you have to do is adjust each fire alarm so that it has a .5% chance of going off. .5 * 10 = 5%, so you’re back to the probability you originally wanted. This is Bonferroni correction.

Yes, that’s pretty much what I thought, even if in you words the analogy looks much better! If everyone in town uses this kind of sensors no one would want to do the fireman :).

Anyway, the only point I’d like to make is that a 0% chance of having a Type I error may be an utopian target either in psychology tests or in medical trials or for electronic sensors, so people always have to deal with these two kinds of error.

My concern is that since Type II error is usually “more dangerous” than Type I, maybe the tradeoff Bonferroni correction poses (being a little more vulnerable to Type II errors) should never be forgot.

Fantastic explanation! You’ve simplified it so much and make it more interesting to learn statistic. Thank you very much!

Hi,

I was wondering if you can help. Your explanations of things are amazing! I am running correlations for 5 personality traits and attitude of 2 different types of therapy (PCT and CBT) and the attitude of PCT vs CBT. I am seeing whether personality traits can predict preference to therapy. I have also included age and gender to look at although the hypothesis is only linked to personality traits and therapy type and age and gender were used to see if other correlations to therapy were present. Thus, my correlation table current is age, gender, 5 personality traits and 3 attitude scales. of the 5 personality traits and 3 attitude scales there are 18 correlation tests. If i include the age and gender ones we add another 17 tests, totally 35 tests. (I am no interested in how 5personality traits are correlated to each other). I have all the correlations in a large table, but I feel that as I am not interested in age and gender as the main findings this would penalise the p-value? or would I include them in the bonferroni regardless if I am to report results related to gender and age as other findings? my number of participants n=526.

1/what is the number of tests as a thershold in order to use bonferroni test?

2/ I’m not sure how many tests I have to calculate bonferroni at, 18 or 35?

3/ with bonferroni alpha would therefore be 0.05/18 = 0.003? or 0.05/35 at 0.001?

4/ currently some correlations are significant without bonferroni correction, but after bonferroni correction would appear nothing would below the new alpha value, when reporting results do I report both, i.e. report what it was before and after bonferroni correction or just report it with the bonferroni correction?

Pretty length, but would really appreciate any help.

Thanks

“I have all the correlations in a large table, but I feel that as I am not interested in age and gender as the main findings this would penalise the p-value?”

You should have planned what your follow up test would be based on your hypotheses, not hedging your bets to get a good p value on a certain test. Do whatever tests you have based your hypotheses on, and let the chips fall where they may. That covers your first 3 questions — but read the rest of the comments as this has been covered before — there aren’t hard and fast rules that cover every situation, or if there are I don’t know them. For your fourth question, report only your corrected values.

Good luck!

if I have a total of 45 tests on a Pearson’s correlation, my bonferroni correction is alpha = 0.05/45 = 0.001. This subseuqently makes my findings insignificant. Is there any point in doing a linear regression analysis to the findings that were significant before bonferroni correction? And if I do carry out a regression analysis on these, would I need to bonferroni correct these also?

Do I need to use Bonferroni correction for paired t-tests? I have 6 pairs but I compare only pre and post for each group. I don’t do Independent t-tests. Thanks!

Ella

Hi, I have a problem with Bonferroni correcton. I’m doing multiple unpaired t-test (I run a lot of tests and compare means them to one negative control mean) and wanted to correct them somehow but the number of comparisons exceeds 100 so using Bonferroni makes all of the insignificant. Is there any better way to correct them or should I maybe use another way of testing altogether?

Hi,

This is explanation is very help, however can I ask how you would go about conducting a Bonberroni correction for a series of bivariate Pearson correlations among a large number of variables and with a large sample size? This is what I am currently doing and, because of the large sample size and large number of variables, the correlation coefficients have an increased likelihood of appearing significant when they may not be so.

Unlike with ANOVAs and other GLM procedures, there is no regular option in the bivariate correlation dialogue box in SPSS to command a Bonferroni correction to the correlations, either through diving the alpha value or multiplying the p values. I have also searched high and low to see if a syntax command exists that could be applied, but have been unable to locate one.

Is manual calculation the only way a Bonferroni can be applied? I would really like to avoid this if I can as I have over 130 comparisons to calculate for…is there a way of doing this in SPSS, even if through syntax?

Holy crap. In one frigging sentence you explained something that most people turn in to a book. I hope you teach.

Absolutely intelligently written article. Thank you!

Thank you for the explanation. Bonferroni makes much more sense to me now. Can I please check I have understood correctly for my situation. I have 11 subjects being testing all doing an identifcal exercise routine on 3 separate occassions. The first and second occasions are (as far as posisble) conducted in identifcal ambient conditons, in order to validate the approach (I hope to find no significant difference in independent variables – hence providing some evidence of validity of the method). The third occassion is in a much hotter room. The aim is to see whether any of the independent variables are significantly different in the hotter room, and if so by how much. The three independent variables are mean heart rate, mean energy expenditure and mean power output, so these are measured in each ambient condition. I am performing repeated measures ANOVA for each of 3 independent variables. Do I need to use Bonferroni adjustment? If so, what should the correction be?

Oops – in all cases in my most above, where I said “independent” variables, I meant to say “dependent”. The main independent variable is temperature.

Warren and other readers, there is a good news – you can apply sequential Bonferroni corrections that are less conservative and improve test power. Please check the paper:

Rice, W. R. 1989. Analyzing tables of statistical tests. Evolution 43: 223-225.

Cheers, Lyuba