SISA Research Paper
Estimating the variance of a proportion when the data are clustered. Comparing the variance correction factor and jackknive resampling technique.
The effect of clustering of observations is increasingly considered in the analysis of survey data. Often this clustering is caused by the method of collecting data, such as in a two stage sampling design. First groups, primary sampling units such as schools or hospitals, are sampled in a first stage, and in a second stage individuals are sampled within these groups for study, i.e., the pupils or patients. If both the groups and the individuals within the groups are collected randomly important sample estimates such as means and proportions will be unbiased. However, the estimation of the variance and standard errors will often be too low, and confidence intervals too narrow. The extend in which the variance is affected depends on the level of clustering of individuals in the groups from which they were sampled. For example, if pupil characteristics are completely determined by the schools and schools are therefore different with regard to pupil characteristics, confidence intervals at the pupil level would have to be relatively wide. If, however, the schools are not different with regard to pupil characteristics the confidence intervals would be narrower and comparable with confidence intervals obtained with simple random sampling. The level of clustering is often expressed in the intra-correlation coefficient, this coefficient takes on the value zero in the case of no clustering, and the value one if the clustering is total.
Two methods are often used to correct the sample variance and the confidence interval for clustering effects. First, methods based on some type of analysis of variance, the estimation of between and within group variance components, followed by the calculation of the intracorrelation coefficient and the determination of a variance correction factor. Second, the use of resampling, bootstrapping, jackknife techniques to construct a number of subsamples from the full sample and computing the statistic of interest for each subsample. The mean square error of the subsample estimates around the full sample estimate provides the estimate of the variances of the statistics.
In this paper the variance correction factor is compared with the resampling technique for the estimation of a variance and confidence interval around a single binomial proportion using simulated data. In practical research it could concern the proportion of cigarette smokers in a population or the proportion of deaths after an operation. The intent is to assess the feasibility of both techniques for the practical researcher.
Considering that there are k groups with on average m observations per group and a total sample size of N=k*m. The variance correction factor Cm is included in the usual calculation for the variance of the overall sample binomial estimator ^p: var(^p)=Cm p(1-p) / N. Cm will take on the value 1 if the intra-correlation coefficient rho (greek symbol rho) equals 0 and Cm will take on the value m if the intracorrelation coefficient equals 1. This way the condition is satisfied that the var(^p) is in-between the lower bound for the variance if there is no intracorrelation: var(^p)1 = p(1-p) / N; and the upper bound in the case of complete intracorrelation: var(^p)2 = p(1-p) / k. Calculations are done acceding to methods proposed by Donald and Donner (1987), Donner and Klar (1994), and Fliess (1982). A spreadsheet which will enable the reader to replicate the calculations is available here
Resampling is done in this study according to the jackknive-n technique and the table program as implemented in the computer program Wesvar IV (2001). We estimated the confidence interval using s.e.*1.96, and not using a t-value, as in the Wesvar computer program. Note that the jackknive technique also needs to satisfy var(^p)1< var(^p)< var(^p)2.
Table 1: Variance estimate for hypothetical data. Simulation one, high intracorrelation.
Table 1 shows a set of hypothetical data with a relatively high level of intracorrelation. This data-set could be data on the number of deaths after an operation with each row showing data from a single hospital. As can be seen, the proportions of "success" differ strongly between "hospitals", the hospital in row 1 has a proportion of 0.875, or 88%, success, the hospital in row 6 only 14.3%. The intra-correlation coefficient (Rho) equals 0.42. The small, simple random sample estimate with N equals 311 in the denominator of the variance calculation for the marginal proportion of 0.379 equals 0.028; and the high standard error, using the number of clusters for the denominator, equals 0.198. The correct standard error must be somewhere in between these two values. The variance correction method and the Jackknive method produce a reasonably similar estimate at around 0.12. In a simple random sampling situation an N of 12.4 for the variance correction estimate, and of 17.6 for the jackknive estimate, would produce the respective standard errors for a proportion of 37.9%.
Table 2: Variance estimate for hypothetical data. Simulation two, low intracorrelation.
Table 2 has low intracorrelation at 0.081. However, the corrected standard errors are already twice as large as the small, simple random sample estimate of 0.026 with N equals 311. The standard error for the variance correction method gives a standard error of 0.062 with an approximate simple random sample equivalent ^N of 55; the standard error for the jackknive equals 0.058 with and approximate ^N of 63. The confidence intervals produced by the two variance correction methods differ very little.
Table 3: Variance estimate for hypothetical data. Simulation three, high intracorrelation with skewed marginal.
Table 3 shows a situation of high intracorrelation with a skewed marginal. The variance correction method estimate is almost as large as the high standard error. The jackknive estimate is larger than the high estimate and falls outside the valid range in-between the low and the high estimate of the standard error. Further study showed that the variance correction estimate would also fall outside this range if the marginal becomes even more skewed.
Table 4: Variance estimate for hypothetical data. Simulation four, low intracorrelation with skewed marginal.
Lastly table 4 shows a situation with a skewed marginal and low intracorrelation. The intracorrelation is negative, which is not a valid value as 0 is the lowest possible value that can occur (in the case of no-intracorrelation). The standard error for the variance correction method is at 0.012 lower than the low estimate of the standard error at 0.023. The standard error for the jackknive is in-between the low and the high standard error estimate, but, given the probably very low level of intracorrelation the estimate seems conservative and too close to the high standard error.
Variance estimation correction should always be considered when data are clustered. In this paper two techniques for correcting the estimation of a variance around a single proportion have been compared, variance correction by way of calculating a correction factor on the basis of the intracorrelation coefficient, and the jackknive resampling technique. The first important finding of this study is that variance correction in proportional data should be considered even when there is little clustering in the data. The effect of the correction seems always quite pronounced, even when the intracorrelation and the level of clustering are low. Both techniques of variance correction perform equally well in "reasonable" data and the choice for the use of a technique should be based on practical grounds. Jackknive resampling techniques are more flexible and allow for more complex designs to be considered, such as unequal weighing between clusters, post-stratification, and standardisation. However, to some non-statisticians the technique will seem difficult and the available software cumbersome. Both methods become unreliable if the marginal of the table is skewed, the jackknive being affected by this particularly seriously when there is also high intracorrelation. In the experiments it concerned a level of skewing, i.e. difference in the size of the (primary) sampling units, and intra-correlation, which can reasonably be expected to occur in a practical research situation. Even with moderate skewnes corrected estimates should therefore be compared with estimates which use the number of clusters from which samples have been taken as the denominator in the calculations.
Donald A, Donner A. Adjustments to the Mantel-Heanszel Chi-square statistic and odds ratio variance estimator when the data are clustered. Stat Med 1987;6:491-499.
Donner A, Klar N. Methods for comparing event rates in intervention studies when the unit of allocation is a cluster. Am J Epidemiol 1994;140:279-289.
Fleiss JL. Statistical methods for rates and proportions, 2nd edition. New York [etc.]: John Wiley 1982.
Westat Inc. Wesvar IV. 2000. < http://www.westat.com/wesvar/ >(1 May 2001)
Wolter KM. Introduction to variance estimation. New York [etc.]: Springer-Verlag 1985.
TOP of page
Compare Car Rentals!
Help SISA and compare two rental cars!
An easy way to find the best option.
SISA Research Paper