The link between error bars and statistical significance
By Dr. Harvey Motulsky
When you view data in a publication or presentation, you may be tempted to draw conclusions about the statistical significance of differences between group means by looking at whether the error bars overlap. Let's look at two contrasting examples.
What can you conclude when standard error bars do not overlap?
When standard error (SE) bars do not overlap, you cannot be sure that the difference between two means is statistically significant. Even though the error bars do not overlap in experiment 1, the difference is not statistically significant (P=0.09 by unpaired t test). This is also true when you compare proportions with a chi-square test.
What can you conclude when standard error bars do overlap?
No surprises here. When SE bars overlap, (as in experiment 2) you can be sure the difference between the two means is not statistically significant (P>0.05).
What if you are comparing more than two groups?
Post tests following one-way ANOVA account for multiple comparisons, so they yield higher P values than t tests comparing just two groups. So the same rules apply. If two SE error bars overlap, you can be sure that a post test comparing those two groups will find no statistical significance. However if two SE error bars do not overlap, you can't tell whether a post test will, or will not, find a statistically significant difference.
What if the error bars do not represent the SEM?
Error bars that represent the 95% confidence interval (CI) of a mean are wider than SE error bars -- about twice as wide with large sample sizes and even wider with small sample sizes. If 95% CI error bars do not overlap, you can be sure the difference is statistically significant (P < 0.05). However, the converse is not true -- you may or may not have statistical significance when the 95% confidence intervals overlap.
Some graphs and tables show the mean with the standard deviation (SD) rather than the SEM. The SD quantifies variability, but does not account for sample size. To assess statistical significance, you must take into account sample size as well as variability. Therefore, observing whether SD error bars overlap or not tells you nothing about whether the difference is, or is not, statistically significant.
What if the groups were matched and analyzed with a paired t test?
All the comments above assume you are performing an unpaired t test. When you analyze matched data with a paired t test, it doesn't matter how much scatter each group has -- what matters is the consistency of the changes or differences. Whether or not the error bars for each group overlap tells you nothing about the P value of a paired t test.
What if the error bars represent the confidence interval of the difference between means?
This figure depicts two experiments, A and B. In each experiment, control and treatment measurements were obtained. The graph shows the difference between control and treatment for each experiment. A positive number denotes an increase; a negative number denotes a decrease. The error bars show 95% confidence intervals for those differences. (Note that we are not comparing experiment A with experiment B, but rather are asking whether each experiment shows convincing evidence that the treatment has an effect.)
In experiment A, the 95% confidence interval for the difference between the two means does not include zero. Therefore you can conclude that the P value for the comparison must be less than 0.05 and that the difference must be statistically significant (using the traditional 0.05 cutoff). The 95% confidence interval in experiment B includes zero, so the P value must be greater than 0.05, and you can conclude that the difference is not statistically significant.
This rule works for both paired and unpaired t tests. Note that the confidence interval for the difference between the two means is computed very differently for the two tests.
The link between error bars and statistical significance is weaker than many wish to believe. But it is worth remembering that if two SE error bars overlap you can conclude that the difference is not statistically significant, but that the converse is not true.