REVIEW ARTICLE Year : 2010  Volume : 2  Issue : 7  Page : 288292 Assessing the regression to the mean for nonnormal populations via kernel estimators Majnu John^{1}, Abbas F Jawad^{2}, ^{1} Division of Biostatistics and Epidemiology, Department of Public Health, Weill Cornell Medical College, New York, USA ^{2} Division of Biostatistics and Epidemiology, Department of Pediatrics, University of Pennsylvania, Philadelphia, PA, USA Correspondence Address: Background : Part of the change over time of a response in longitudinal studies may be attributed to the regression to the mean. The component of change due to regression to the mean is more pronounced in the subjects with extreme initial values. Das and Mulder proposed a nonparametric approach to estimate the regression to the mean. Aim : In this paper, Das and Mulder«SQ»s method is made dataadaptive for empirical distributions via kernel estimation approaches, while retaining the original assumptions made by them. Results : We use the best approaches for kernel density and hazard function estimation in our methods. This makes our approach extremely user friendly for a practitioner via the state of the art procedures and packages available in statistical softwares such as SAS and R for kernel density and hazard function estimation. We also estimate the standard error of our estimates of regression to the mean via nonparametric bootstrap methods. Finally, our methods are illustrated by analyzing the percent predicted FEV1 measurements available from the Cystic Fibrosis Foundation«SQ»s National Patient Registry. Conclusion : The kernel based approach presented in this article is a userfriendly method to assess the regression to the mean in nonnormal populations.
Introduction In longitudinal clinical studies, where the change over time of an outcome is of interest, it is sometimes seen that the change is dependent on the initial value. The change over time may be significantly larger in subjects who had really higher (or lower) initial values as compared to subjects whose initial values were closer to the population mean. This differential effect may partly be due to a phenomenon known as regression to the mean. In other words, since there is always withinsubject variability, the initial value may have been high (or low) for certain subjects just by random chance, and upon remeasurement the outcome values for these subjects would have just 'regressed' back to the true mean. For example, this phenomenon may be illustrated via the analysis results of data in a registry based on cystic fibrosis (CF) patients that was conducted by investigators at Children's Hospital of Philadelphia [1] . One of the goals of the study was to determine the rate of change of the pulmonary function as measured by percent of predicted forced expiratory volume in 1 second (FEV 1 %), over a 4year period in a large cohort of children with cystic fibrosis. CF foundation's National CF Patient Registry data collected from 1991 to 1995 for 968 children (461 male) aged 5 to 8 years with pancreatic insufficiency and FEV1% between 60% and 140% were analyzed longitudinally. The significant decline in FEV1% was found to be dependent on baseline FEV1%; children with initial FEV1% ≥ 90 declined 2.6 U/y more than those with initial FEV1% < 90. Various methods have been proposed in the literature to account for regression to the mean. These approaches may be broadly classified into two categories: 1) methods where the regression to the mean is estimated and then subtracted from or added to the observed change to estimate the true change; 2) methods which check for the relationship between change and initial value via correlation or regression, where the correlation coefficient or the regression coefficient as the case may be adjusted for the regression towards mean. Methods of the first type have been proposed and studied in [2],[3],[4],[5],[6],[7],[8] . Methods of the second type have been proposed in [9],[10] . In this paper we propose a method of the first type. We present an estimator for the regression to mean when the underlying distribution for the measurements are nonnormal. Gardener, James and Davis [2],[3],[4] discussed estimators for regression to mean when the measurements are normally distributed with mean and variance that remain constant over time. Chinn and Heller [5] presented similar results when the mean and variance were assumed to vary over time. Das and Mulder [6] investigated the effect when the measurements were not necessarily normal, however their method is not directly applicable to empirical distributions. Their method was made more usable for a practitioner/analyst by a method proposed in Sheath and Dobson [7] where the regression to mean was estimated for empirical distributions using Edgeworth series and saddle point approximations. Muller, Abramson and Azari [8] propose an elegant nonparametric method for estimating the regression to the mean under fewer assumptions related to the underlying distributions. In this paper, Das and Mulder's method [6] is made dataadaptive for empirical distributions via kernel estimation approaches, while retaining the original assumptions made by them. We adopted the best approaches for kernel density and hazard function estimation in our methods. This makes our approach extremely user friendly for a practitioner via the state of the art procedures and packages available in statistical softwares such as SAS and R for kernel density and hazard function estimation. An estimate for regression to the mean in nonnormal populations In this paper we restrict our attention to the studies which measure only one additional value other than the initial value. Following the notational convention in [7] , let X 1i and X 2i be random variables representing the initial and followup measurements on the i th subject, i = 1, ... n, where X 1i = Mi + and X 2i = Mi + . Here Mi, , and are mutually independent random variables representing the true value, measurement error or withinsubject variability for the initial and followup measurements respectively. We also assume that the Mi's are i.i.d. with mean μ and variance θ2 = ρσ2 and e l and e 2 have the normal distributions N(0,Δ2), where Δ2 = (1  ρ)σ2 . For the initial measurements X 1i's let denote the cutoff point for 'large' values; that is all the subjects with initial measurements greater than are considered subjects with high initial value. We further assume that all the X 1i's are i.i.d. with a probability density function and distribution function G(x). Then the expression for evaluating the regression to the mean for the subjects with high initial values, as proposed by Das and Mulder [6] is, [INLINE:1] We propose to estimate by two different approaches as outlined in sections 3 and 4. Let g be a kernel density estimator for g and let be the empirical distribution function for X 1i's greater than . Then in section 3, we propose an estimator[INSIDE:7] of[INSIDE:8] given by [INLINE:2] Let us denote [INSIDE:1]. When X l is a nonnegative variable, is similar to the hazard rate function used in survival analysis literature. We may find an estimator of via the state of the art kernel estimation techniques for hazard rate function. This approach is outlined in section Estimation of regression to the mean via a kernel estimator for u. Estimation of regression to the mean via a kernel estimator for g Kernel density estimates have been studied extensively in literature and used in various applications, ever since they were proposed in [11, 12] For the initial values, X 1i, i = 1, ... n, the kernel density estimator of the probability density function g is given by [INLINE:3] where [INSIDE:2] for a 'kernel function' K (often taken to a symmetric probability density) and a 'bandwidth' h (the smoothing parameter). (See [13] for a good introduction to the ideas.) A common way of measuring the estimation error of is via the mean integrated squared error (MISE), [INLINE:4] where[INSIDE:9] denotes the integration over the real line. Asymptotic mean integrated square error (AMISE) given by [INLINE:5] is a more useful criterion which provides simple insight into "good" bandwidths. Here[INSIDE:3] and [INSIDE:4] The minimizer of AMISE (h) given by [INLINE:6] is a good approximation to hMISE , the minimizer of MISE, in many circumstances. Various methods that have been proposed for optimal bandwidth selection may be broadly classified as first generation and second generation methods [14] . The most popular among the first generation methods are rules of thumb [13] , least squares crossvalidation [15],[16],[17] , and biased crossvalidation [18] . Second generation methods include solvetheequation plugin approach [19],[20] and smoothed bootstrap [21],[22],[23] . As shown in [14] , the method proposed in [20] is a stable and consistent method that outperforms all the other methods mentioned above. The idea behind this method is to choose the bandwidth that is a solution of the fixed point equation [INLINE:7] This is the bandwidth selection method that we use for the kernel density estimate . This is also the default method in many statistical software packages including PROC KDE in SAS version 9.0. Estimation of regression to the mean via a kernel estimator for u For nonnegative X 1i, then the function defined in section 2 is become the hazard function, if we interchange the 'time to event' variable with X 1i and assume no 'censoring'. Kernel based estimation methods of hazard function have received considerable attention in the statistical literature. Kernel estimators for the hazard function was initially proposed in [24] and was further studied in [25] . [26] gave a generalized form to these estimators and studied the asymptotic properties of the generalized estimator. Adapting the generalized form to our notation for regression to mean estimator, we have [INLINE:8] as a kernel estimator for . Here X 1(i)'s denote the ordered sample of X 1i's. As in the case of kernel density estimation, an important problem when estimating the hazard function by kernel methods is the choice of the smoothing parameter (that is, the bandwidth) h. Various methods have been proposed in the statistical literature for choosing the optimal bandwidth. A maximum likelihood crossvalidation method for uncensored hazard estimation was proposed in [27] . Later, [28] (via simulation results) and [29] (via theoretical insights) have shown that least squares cross validation as a more appropriate approach. A least squares crossvalidation methods proposed in [30] for density estimation was extended to uncensored hazard function estimation in [31] and in [32] . Later [33] extended these methods via a bootstrap selection of the smoothing parameter. All the methods mentioned above are fixedbandwidth fixedkernel methods for estimating the hazard function. When the data is not evenly distributed over the range of interest, the degree of smoothing achieved via a fixedbandwidth method will not be uniform. Using a varying bandwidth estimator which balances the local variance and local bias as suggested in [34] rectifies the nonadaptive behavior of fixedbandwidth estimators. Another approach is to incorporate the idea of 'nearest neighbor' into the definition of bandwidth [35] . Fixed kernel estimators do not take into account the socalled boundary effects (that is, the bias problems) near the endpoints of the support of the hazard function. One way to overcome this problem is to change the kernels at the boundary [36],[37],[38] . [34] presented a class of boundary kernels which gave rise to smaller leading constants of the asymptotic mean squared error when estimating than other boundary kernels considered previously. [39] confirmed the advantages of methods proposed in [34] and [35] over the previously published methods for estimating hazard function via kernel estimators. These two methods are available in a R package named muhaz downloadable from cran.rproject.org. Bootstrap estimates of standard error Bootstrap techniques have gained tremendous popularity ever since it was introduced in the seminal paper [40] . Bootstrap methods are well known for estimating the standard errors and bias. A good introduction to the bootstrap world can be seen in [41] and in [42] . Let us denote by our estimate of the regression to the mean via any of the methods proposed in sections 3 and 4. We may obtain the bootstrap estimate of standard error and bias via the following steps. First we obtain nonparametric bootstrap samples of the data. A nonparametric bootstrap sample is obtained by sampling with replacement from the empirical distribution of the original data. For each of the B bootstrap samples, we calculate our estimate of regression to the mean, denoted by[INSIDE:5] The bootstrap estimate of the standard error of is simply the standard deviation of[INSIDE:6] That is, the standard error of is obtained as [INLINE:9] Application to percent predicted FEV1 measurements in cystic fibrosis patients We illustrate the methods described in this paper using percent predicted FEV 1 measurements of 461 males in the CF registry obtained in the years 1991 and 1992. The correlation between the measurements from years 1991 and 1992 was found to be 0.55 and variance of the 1991 measurements was 323.3. (See [Figure 1]a and b.) The PROC MIXED procedure in SAS with a random intercept was used separately for subjects with the initial value (that is, 1991 measurement) greater than and less than or equal to 90. For the subjects with initial value above 90, the intercept was 106.76 and the slope was 8.69 with standard errors 0.97 and 1.38 respectively and for the subjects with initial value below or equal to 90, the intercept was 77.00 and the slope was 1.37 with standard errors 1.08 and 1.54 respectively. (See [Figure 1]c.) We suspect that the regression to the mean plays a significant role in the change seen in subjects with high initial value. {Figure 1} We assess the regression to the mean in subjects with initial measurements greater than 90 via the two methods described in this paper. For the method in the section 3, we used the Gaussian kernel along with the plugin bandwidth selection method described in [20] to estimate g. This was done via PROC KDE in SAS. The regression to the mean via this method was estimated to be 4.78 with bootstrap standard error of 0.85. Since the percent predicted FEV1 measurements is a nonnegative variable, we were also able to estimate the regression to the mean via the method described in section 4. We used the varying kernel, varying bandwidth method suggested in [34] with the Epanechnikov boundary kernel to estimate u. This was done using the muhaz R package. (See section 4 for more details.) The regression to mean estimate via this method was seen to be 5.21 with bootstrap standard error of 1.02. The regression to the mean estimate by both the methods are comparable in this case. Via the first method we see that 55% of the change in percent predicted FEV1 between 1991 and 1992 is due to the regression to the mean effect and by the second method we see that 60% of the change is attributable to regression to the mean. Concluding remarks We propose two approaches for estimating Das and Mulder's expression of regression to the mean for nonnormal population. Our first approach is based on a kernel density estimation method and our second approach is based on kernel estimation approaches for hazard rate function. The estimates in both the methods may be easily obtained via the state of the art techniques available in popular statistical software such as SAS and R. We also calculate the bootstrap standard error for estimates. Applying our methods to FEV % data from Cystic Fibrosis foundation's National Patient Registry showed that both the regression to the mean estimates are within the margin of bootstrap standard error estimates. As in Das and Mulder's original paper, our methods are restricted to calculating the regression to the mean when there are only two time points. Extension of these methods to longitudinal studies with more than two time points is an area of future research. Acknowledgements Partial support for the first author came from Clinical Translational Science Center grant (UL1RR024996). References


