Cross-Validation

Students cheer on the Redhawks during a sporting event at Miami University.

Cross validation helps us evaluate how accurately a predictive model will perform in a more general sense. One way of doing this is to test the model using data the model hasn't seen yet. This could be newly collected data or a subset of the data you already have but purposely held out when building your model. This data is called the test set. The data used for parameter estimation or model building is called the training set. The predictive accuracy of a model developed using the training set can then be measured using a metric such as MSE (Mean Square Error) when using the test set. It is likely that a practitioner will repeat this process a number of times and then summarize the results by creating some overall metric like an average MSE.

The training set typically has a much larger number of observations than the test set. However, there are many different approaches to how someone could split the data into these two sets. Two options we will explore here are the crossv_kfold() and crossv_mc() functions from the modelr package.

k-Fold Cross Validation

crossv_kfold() randomly splits the data into k exclusive partitions and then uses each partition as a test set while combining the other k-1 partitions to form the training set. This produces k total test-training pairs which can be further analyzed by looking at an overall metric such as average MSE for all of the k test sets.

crossv_kfold(data, k = 5)

Argument meanings:

  • data: R object must be of the variable type data frame
  • k: number of folds or test-training pairs the data will be partitioned into (if not stated, a default of k = 5 partitions is used)

Let's look at an example using the iris data set. This data set has 150 observations and 5 variables - sepal length and width, petal length and width, and the species of the iris. Assume we want to access the ability of predicting the width of an iris petal when knowing only the length of the petal. Further assume we believe that partitioning the 150 observations into eight distinct groups will result in acceptable sized test sets and number of estimated models to explore.

library(moder)      # load neccesary package
set.seed(1)         # use to replicate results for this example
cv_kf <- crossv_kfold(iris, k = 8)
cv_kf

## A tibble: 8 x 3
##            train           test   .id
##                     
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5
## 6       6
## 7       7
## 8       8

The result cv_kf is a tibble object, which is a tidy form of a data frame. We can see that each of the 8 rows corresponds to a different partition, the first and second columns correspond to the train and test sets respectively, and the final third column is just an identifying label. We can also see that the elements of cv_kf are resample objects, but for the sake of simplicity we'll consider these as list type objects. This means we can extract specific training and test sets by indexing into cv_kf as if it were a data frame made up of list objects.

# extract row indices for first training set
idc1 <- cv_kf$train[[1]]$idx
# training data set for first partition
train1 <- iris[idc1, ]

Next, let's use the first training set to fit a simple regression model using the lm() function. For more on using lm() please refer to the regression page.

model1<-lm(Petal.Width ~ Petal.Length, data = train1)
# ANOVA
summary(model1)

## Call:
## lm(formula = train1$Petal.Width ~ train1$Petal.Length)

## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43822 -0.12255 -0.02255  0.13631  0.65541 

## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.353357   0.040852   -8.65 1.76e-14 ***
## train1$Petal.Length  0.411361   0.009857   41.73  < 2e-16 ***
## ===
## Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

## Residual standard error: 0.1977 on 129 degrees of freedom
## Multiple R-squared:  0.931,    Adjusted R-squared:  0.9305  
## F-statistic:  1742 on 1 and 129 DF,  p-value: < 2.2e-16

To learn how to automate this for all the partitions please refer to the tutorials for using for loops, lapply(), or the map() function from the purrr package.

Monte Carlo Approach

crossv_mc() randomly selects a proportion, test, of the observations as the testing set and uses the remaining 1-test as the proportion of observations for the training set. This procedure is repeated n times to form n test-training pairs. There are no default settings for n, so the user must provide this or R will return an error message.

crossv_mc(data, n, test = 0.2)

Argument meanings:

  • data: R object must be of the variable type data frame
  • n: number of train and test pairs to make
  • test: proportion of data to be held for the test set; by default 20% of the data is used for testing

Let's look at an example where we want to identify 30% of the 150 observations for testing and we want to repeat this 20 times

# load necessary package
library(modelr)
# randomly identify 45 of the 150 rows for later use as 20 different testing sets
cv_kf <- crossv_mc(iris, n = 20, test = 0.3)

Note: One possible drawback to using the Monte Carlo method is that there is a possibility that some of the rows will never be chosen for testing or training purposes and it is also possible that some rows can be chosen multiple times for the same set.

Need a Refresher?

Go back to the beginner tutorials.