Jacob Simmering

jacobsimmering.com

Mar 20, 2014

Stop using bivariate correlations for variable selection

Something I've never understood is the widespread calculation and reporting of univariate and bivariate statistics in applied work, especially when it comes to model selection. Bivariate statistics are, at best, useless for multi-variate model selection and, at worst, harmful. Since nearly all questions of interest are highly dimensional, the use of bivariate statistics is worrisome.

What typically happens is a researcher sits down with their statistical software of choice and they compute a correlation between their response variable and their collection of possible predictors. From here, they toss out potential predictors that either have low correlation or for which the correlation is not significant.

The concern here is that it is possible for the correlation between the marginal distributions of the response and a predictor to be almost zero or non-significant and for that predictor to be an important element in the data generating pathway.

Consider the following simulation that illustrates the difference between the marginal relationship between the response and a predictor and the true multi- variable relationship. The function genY generates the response using user provided values for \( \beta \) and adds in a little white noise.

genY <- function(betas, X, stdPercent = 0.25) {
    modelMatrix <- cbind(1, X)
    yhat <- modelMatrix %*% betas
    y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat))
    y
}

The function cor returns the correlation between two values and works with a vector and a matrix argument but cor.test requires two vectors. For this I wrote a simple function sigCor that returns TRUE if a correlation between the vector x and y is significant at a user provided threshold and FALSE otherwise. This function is designed to work with apply to filter our predictor set.

sigCor <- function(x, y, threshold = 0.2) {
    p <- cor.test(x, y)$p.value
    p <= threshold
}

The main function then generates a random covariance matrix and draws from a multivariate normal centered at 0 with that covariance matrix. The values of the matrix X are then checked for correlation with the response y. If the absolute value of the correlation exceeds some threshold or if the p value for the significance of the correlation is less than some threshold, the variable is kept. Otherwise it is dropped. The function returns the number of retained variables for each selection method.

simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) {
    require(MASS)
    require(clusterGeneration)
    betas <- rep(1, nx + 1)
    mu <- rep(0, nx)
    sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
    X <- mvrnorm(n, mu, sigma)
    y <- genY(betas, X)
    xcor <- cor(y, X)
    limitedX <- X[, abs(xcor) >= corLim]
    pX <- X[, apply(X, 2, sigCor, y, pLim)]
    nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1)
}

If we keep the defaults of corLim = 0.2 and pLim = 0.2 (keep any correlation where the absolute value exceeds 0.2 and any correlation with a p value less than 0.2) and use replicate to run the function 1000 times with 10 predictors and \( n = 1000 \), we can calculate the probability that the different method selects the correct model. (Note that we set all of values of betas to 1, so all 11 coefficient estimates should be non-zero and all of the predictors are correlated with the outcome).

results <- replicate(1000, (simulateSelection(10, 1000)))
correctModelFraction <- apply(results == 11, 1, mean)
correctModelFraction
## [1] 1.000 0.026 0.573

Using a filter based on the size of the correlation only yields the correct model 2.6% of the time. Using a relatively high p value as a threshold does better returning the correct model 57.3% of the time.

We can explore the effect of varying the value of pLim using a few simple modifications to the above code.

simulateSelectionP <- function(nx, n, pLim) {
    require(MASS)
    require(clusterGeneration)
    betas <- rep(1, nx + 1)
    mu <- rep(0, nx)
    sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
    X <- mvrnorm(n, mu, sigma)
    y <- genY(betas, X)
    pX <- X[, apply(X, 2, sigCor, y, pLim)]
    correct <- (ncol(pX)) == nx
}
varyP <- function(nx, n, p) {
    fractionCorrect <- numeric(100 * length(p))
    j <- 1
    for (i in 1:length(p)) {
        set.seed(42)
        fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx, 
            n, p[i]))
        j <- j + 100
    }
    fractionCorrect
}

Using these new functions, we can calculate the fraction of the time that a threshold of p yields the correct model. If we vary p from \( 0.01--0.50 \) and plot the resulting curve,

prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20))
results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20), 
    each = 100))
meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean)
sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd)
meanByLimit$sd <- sdByLimit$x
meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100)
meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit$sd/sqrt(100)
ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() + 
    geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") + 
    scale_y_continuous("Fraction of models in simulation that are correct")

we see that, as expected, the probability of getting the correct model goes up as we increase the threshold value. We have about a 75% shot at getting the right model using this method when we set the limit on the p value to be 0.50 compared to under 30% at a p value of 0.01.

However, this raises an interesting question: if we increase the threshold for the p value, what is the value of using the bivariate correlation at all?

randomRate <- function(n, pLim) {
    x <- rnorm(n)
    y <- rnorm(n)
    sigCor(x, y, pLim)
}
p <- seq(0.01, 0.5, length.out = 20)
rate <- numeric(length(p))
for (i in 1:length(p)) {
    rate[i] <- mean(replicate(1000, randomRate(1000, p[i])))
}
rbind(p, rate)
##       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
## p    0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163
## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210
##       [,10]  [,11]  [,12]  [,13]  [,14]  [,15]  [,16]  [,17]  [,18]  [,19]
## p    0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742
## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660
##      [,20]
## p      0.5
## rate   0.5

As the threshold increases, the value of the bivariate correlation as a filter goes down and goes down fast. At the limit required for there to be a 50% chance of the model containing all relevant predictors, the rate of improperly included variables would go up to nearly 20%.

The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.

Tags: statistics r modelSelection regression

2 Comments

From: Jacob
2014-03-22 22:53

Phil,
I totally agree the answer is in the data. Most people I work with are better off using domain knowledge because they have a lot more of that then they have of statistics. However, I really like methods like cross-validation, AIC and BIC as tools for model selection, including variable selection. I haven't finished the video on your post or completely digested your code yet but your methods appear to be more complicated than simply bivariate analysis.
My concern that I'm attempting to get at with this post is that bivariate correlations may not be informative about the model in k dimensions. People in public health (my "domain" background) often use simple correlation coefficients to include/exclude variables from the model. I don't think that is an effective or harmless way to do variable selection in many cases.

From: Phil
2014-03-22 21:01

Hi Jacob,
I don't believe domain knowledge should need to be used as all domain knowledge is data driven in the first place - the answer is in the data, so lets use it. I describe a data driven approach to variable selection in my blog post below.
http://www.anotherdataminingblog.blogspot.co.uk/2013/10/techniques-to-improve-accuracy-of-your_17.html

Post a comment

All comments are held for moderation.