Input should be the pursued alpha level, a decimal number between 'zero' and 'one' in the top box. The number of comparisons, a positive integer number without decimals, is given in the second box. Emfm279%*

Optional, one can set the mean r (correlation) to zero for full Bonferoni correction and to a value between 0 and 1 for partial Bonferroni correction.

A further option is to give the degrees of freedom to obtain the critical value for t, instead of the critical value for z. The degrees of freedom should be the number of cases in the study minus one. Also, if degrees of freedom are given the t-value is given for the comparison of k>2 independent means, according to Scheffé's method. The degrees of freedom in this case should also be the number of cases minus one.

The Holm method is a top-down stepwise procedure whereby the Bonferroni procedure is recalculated time after time for the hypothesis left to test. Comparing observed p-values with critical p-values step by step, as soon as a non-significant p-value is found all remaining p-values are declared non-significant. This cryptic description is illustrated in the table.

order | Holm-B | Holm-S | B-H | your-p | Holm | B-H |

1 | 0.00714286 | 0.00730083 | 0.00714286 | 0.00001 | smallest | smallest |

2 | 0.00833333 | 0.00851244 | 0.01428571 | 0.00099 | ↓ | ↑ |

3 | 0.01 | 0.01020622 | 0.02142857 | 0.003 | ↓ | ↑ |

4 | 0.0125 | 0.01274146 | 0.02857143 | 0.0145 | ↓ | ↑ |

5 | 0.01666667 | 0.01695243 | 0.03571429 | 0.016 | ↓ | ↑ |

6 | 0.025 | 0.02532057 | 0.04285714 | 0.06 | ↓ | ↑ |

7 | 0.05 | 0.05 | 0.05 | 0.13 | largest | largest |

For the Holm method you order your p-values from smallest to largest in the "your p" column. According to the Holm procedure you start in the first row, in this row your-p is smaller than the critical p, the p-value is declared statistically significant. Continue doing this until you reach the 4th row. In this row your-p is larger than the critical p. Declare this p and all p-values larger than this p non-significant, including the p-value in the fifth row, which is lower than the critical p-value. Holm-B is based on the Bonferroni procedure, Holm-S on the Sidak procedure.

Bonferroni adjustment procedures (generally named "family-wise error rate (FWER)" procedures) are correct if you want to control the occurrence of a single false positive, one or more incorrectly significant declared results, in a family of N tests. False discovery rate (FDR) procedures are based on the notion of having a prior expected proportion of false positives among k significant declared tests. If the p-values are from independent tests the binomial cumulative function gives you the FEWR given a certain FDR. Say, you expect 5% of 10 positive outcomes to be false positive, then there is a 40% probability of one or more false positives, an 8.6% probability of two or more false positives, etc. ^{$}

One such FDR procedure is the Benjamini-Hoghberg procedure. The p in the alpha box now stands for proportion, and no longer for probability. It is the proportion of false positive results in the statistically significant results. This proportion is sometimes set higher as the usual 0.05, dependent of the cost-benefit of a particular outcome (McDonald, 2009). However, this would increase the number of false positives. In the Benjamini-Hochberg procedure you work bottom-up. The p-value in the 7th row is larger than the critical p-value and declared non-significant. Same for the 6th row. However, in the 5th row your-p is smaller than the critical p-value. You stop testing and declare this p-value and all p-values smaller than this p-value statistically significant. This test requires the p-values to be independent. Thus, the test is valid if you compare the p-value of the comparison A<->B with the p-value of the comparison C<->D. However, the test is mostly not valid in comparing the p-value A<->B with the p-value A<->C. If A is larger than B it is more likely to be also larger than C.

Benjamini and Yukutieli proposed the divisor "c" in case the assumptions of the B-H procedure are not met. In the case of independence or positive correlation of the tests c=1. So nothing changes. In the case of arbitrary or unknown dependence you need to divide the B-H critical values by the Benjamini-Yukutieli divisor in the output. The step-up method further works the same. In the case of the table above the Benjamini-Yukutieli divisor=2.593 and the critical values then become 0.00714/2.593=0.0028; 0.0142/2.593=0.0055; 0.0083; 0.011; 0.0138; 0.0165 and 0.0193 respectively. With Benjamini and Yukutieli correction only the p-values in the first three rows are declared statistically significant.

With no- or moderate dependence and no- or moderate correlation the best choice would be the Holm-Sidak. With high correlation use the correlation adjusted Sidak, if it is more powerful compared with the Holm. Note that the correlation adjusted Bonferroni and Sidak are less well researched and not generally accepted. Further, consider that in many cases the best choice is not to use any of these procedures and to leave the p-values uncorrected. Read the discussion below. The Benjamini-Hoghberg is a conceptually unusual procedure only to be applied to answer the specific question of the false discovery rate.

The Bonferroni correction/adjustment procedure is the most basic of SISA procedures, however, Bonferroni correction concerns an issue about which there is much, and ongoing, discussion. Bonferroni correction concerns the question if, in the case of doing more than one test in a particular study, the alpha level should be adjusted downward to consider chance capitalization.

The alpha level is the chance taken by researchers to make a type one error. The type one error is the error of incorrectly declaring a difference, effect or relationship to be true due to chance producing the observed state of events. Customarily the alpha level is set at 0.05, or, in no more than one in twenty statistical tests the test will show 'something' while in fact there is nothing. In the case of more than one statistical test the chance of finding at least one test statistically significant due to chance fluctuation in the total experiment, and to incorrectly declare a difference or relationship to be true, increases. In five tests the chance of finding at least one difference or relationship significant due to chance fluctuation equals 0.22, or one in five. In ten tests this chance increases to 0.40, which is about one in two. Using the Bonferroni method the alpha level of each individual test is adjusted downwards to ensure that the overall -experiment wise- risk for a number of tests remains 0.05. Even if more than one test is done the risk of finding a difference or effect incorrectly significant continues to be 0.05.

Although the logic is beautiful, there is a serious drawback. If the chance of incorrectly producing a difference, making a type one error, on an individual test is reduced, the chance of making a type two error is increased, that no effect or difference is declared, while in fact there is an effect. Thus, by reducing for individual tests the chance on type one errors, i.e. the chance of introducing ineffective medical treatments or ineffective improvements; the chance on a type two errors is increased, i.e. the chance that effective treatments or improved production methods, are not discovered. So, when is Bonferroni correction used correctly and when is it used incorrectly?

Scenario one. If a crosstabulation of two variables produces more than two means in the dependent variable, than if multiple tests are done; Bonferroni adjustment should be applied. Mostly it concerns "oneway" analysis of variance. This is the case, for example, if we want to compare three religious groups on their attitudes towards alcohol use, or four groups of medical specialists on their usage of pain relief strategy after surgery. There is an extensive literature on this case and there are a multitude of different tests and methods to lower the experiment wise error rate. However advanced and well thought out many of these methods are Bonferroni correction often will be the best choice. One of the more advanced methods, Scheffé's method, is produced at the bottom of the table if there is a number of degrees of freedom for the study number of cases. Scheffé's method is not very powerful; however, there are more powerful methods available in many statistical packages. Scheffé's method has the advantage that if the overall f-test in a one-way analysis of variance is not significant then none of the individual comparisons will be significant. Most of the methods to adjust for multiple comparisons in k-means are based on the assumption that you want to compare any mean with any other mean, so, these methods mostly presume that you want to do c=k*(k-1)/2 comparisons in k means. Thus, if you want to compare Christians with other religions, but not the other religions against each other, in that case the simple Bonferroni method is better.

Scenario two . If a single hypothesis of no effect is tested using more than one test, and the hypothesis is rejected if one of the tests shows statistical significance, Bonferroni correction should be applied. For example, in a factory there are five points where quality control is applied on samples of a product, and the product is rejected for the market if a sample is below the benchmark on only one of these five tests, then the chance of rejection at each of the control points should be downwardly adjusted to keep the overall chance of incorrect rejection at a predefined level. In a randomized controlled trial (RCT) a group of patients on a new anti-diabetic rug is compared with a group of patients on placebo. To study the new drugs effectiveness blood sugar level is determined in three different locations of the patients' body. If a statistical significant difference between the treatment and the control group is found on only one of these tests the drug is considered effective. Each of the tests should be made less sensitive to ensure that the risk of a false positive, the risk of incorrectly declaring the drug effective and giving future patients pointless medication, does not become unacceptably high due to repeated testing. Basically, scenario two is not considered problematic and you should apply Bonferroni correction in such cases.

Scenario two with correlated multiple outcomes. If you test for the significance of a hypothesis using variables that are mutually correlated the Bonferroni correction is too conservative. For example, in an RCT a number of outcome variables are fully correlated. In that case knowledge of the outcome of a single test of a difference between the control and experimental group on a single variable, would be sufficient to know the outcome of the other tests on the other outcome variables. The usual Bonferroni correction would be way too conservative. In the case of correlated outcome variables a corrected alpha is required which is in between no correction at all and full, Bonferroni, correction. SISA allows you to add the mean correlation between variables as a parameter. For this you need the usual triangular matrix (without the diagonal) of the correlations between the outcome variables, sum the correlations and divide the result by the number of correlations used. A mean correlation of zero ('0') gives you full Bonferroni adjustment, a mean correlation of one no adjustment at all, for other values of the correlation you will get a corrected alpha which is in between the two extremes. Note that in the classic quality control example discussed above, using repeated independent samples, correlation should not be considered. In the example about multiple outcomes in an RCT, with multiple measurements on the single subject, correlation must be considered.

One of the problems with scenario two is that one could argue that in the case of Bonferroni correction all null-hypotheses that are the subject of Bonferroni adjustments should be rejected if only one hypothesis is false. This is known as "the global null hypothesis". For example, in the case of the blood sugar tests mentioned above, the drug will be declared effective if only one test shows statistical significance, not considering the fact that two tests might not be significant. Few scientists who apply Bonferroni adjustment are prepared to do this and they generally like to keep the option open to consider tests on their individual merit, which brings us to scenario three.

Scenario three is much more disputed. This is the case when in a single study more than one hypothesis is evaluated, each hypothesis with a single test. If the alpha level of each test is set at 0.05, at least one in twenty of the hypothesis tested will be significant, due to chance fluctuation. For example, in a life style study blood pressure, television viewing behavior, leisure time physical activity, and cigarette smoking are studied. Explaining variables are age, gender, occupation and ethnic background. Now, if one is interested in the general question whether the background variables are related to the life style variables, and to that end a number of comparisons are made, this is scenario two and Bonferroni correction should be used. However, if one is interested in the specific relationship between, say, gender and television viewing, and the specific hypothesis is tested that the respondents' gender is not predictive of television viewing behavior, then Bonferroni correction should __not__ be used. Most statisticians are of the opinion that the study of a single topic or hypothesis should, in the case of using pre-defined statements and existing theory, not be affected by what goes on in other places in the world, or in the study concerned, for that matter. Each little study done in the context of a larger study should be considered on its own merits. However, this point of view is not universally supported and particularly in medicine there is an opinion that each test in a study should be considered in the light of the number of tests done in the study as a whole.

Scenario four concerns the situation when non-predefined hypotheses are pursued using many tests, one test for each hypothesis. Basically this concerns the situation of data 'dredging' or 'fishing': many among us will recognize *correlation variables=all* or *t-test groups=sex(2) variables=all*. Above all, this should not be done. Bonferroni correction is difficult in this situation as the alpha level should be lowered very considerably in situations of such wealth (potentially with a factor of r*(r-1)/2, whereby r is the number of variables), and most standard statistical packages are not able to provide small enough p-values to do it. SISA's advice is, if you want to go ahead with it anyway, to test at the 0.05 level for each test. After a relationship has been found, and this relationship is theoretically meaningful, the relationship should be confirmed in a separate study. This can be done after new data is collected or, in the same study, by using the 'split sample' method. The sample is split in two, one half is used to do the 'dredging', the other half is used to confirm the relationships found. The disadvantage of the split sample method is that you lose power (use the procedure power to estimate how much). A Bayesian method can be used if you want to formally incorporate the result of the original study or dredging in the confirmation process. But don't put too high a value on your original finding.

Bender, Ralf, & Stefan Lange. Adjusting for multiple testing—when and how?. Journal of clinical epidemiology 54.4 (2001): 343-349.

McDonald, John H. Multiple comparisons. In: McDonald, John H. Handbook of biological statistics. Vol. 2. Baltimore, MD: Sparky House Publishing, 2009.

Perneger TV. What is wrong with Bonferroni adjustments. *British Medical Journal* 1998;136:1236-1238.

Rothman, Kenneth J. No adjustments are needed for multiple comparisons. *Epidemiology* 1.1 (1990): 43-46.

Sankoh AJ, Huque MF, & Dubey SD. Some comments on frequently used multiple endpoint adjustments methods in clinical trials. *Statistics in Medicine* 1997;16:2529-2542.

$: When proportion FDR of N positively tested independent results is expected to be false then the probability of at least one false positive result FWER=1-(1-FDR)^{N}; FDR=1-(1-FWER)^{1/N}