This vignette provides a guide to performing cross validation with modelr and maxcovr. In the future if this is a feature that is under demand, I will incorporate cross validation into maxcovr. In the mean time, here’s a vignette.

performing cross validation on max_coverage

We will stick with the previous example using york and york_crime

## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Thanks to the modelr package, it is relatively straightforward to perform cross validation.

This creates a dataframe with test and training sets

## # A tibble: 5 x 3
##   train                 test                .id  
##   <list>                <list>              <chr>
## 1 <tibble [1,451 × 12]> <tibble [363 × 12]> 1    
## 2 <tibble [1,451 × 12]> <tibble [363 × 12]> 2    
## 3 <tibble [1,451 × 12]> <tibble [363 × 12]> 3    
## 4 <tibble [1,451 × 12]> <tibble [363 × 12]> 4    
## 5 <tibble [1,452 × 12]> <tibble [362 × 12]> 5

We then fit the model on the training set using map_df, from the purrr package.

##    user  system elapsed 
##   4.913   0.650   5.604

Then we can use the summary_mc_cv function to extract out the summaries from each fold. This summary takes the facilities placed using the training set of users, and then takes the test set of users and counts what percent of these are being covered by the training model.

n_added n_fold distance_within n_cov pct_cov n_not_cov pct_not_cov dist_avg dist_sd
20 1 100 48 0.1322314 315 0.8677686 1197.384 1561.703
20 2 100 40 0.1101928 323 0.8898072 1068.079 1370.440
20 3 100 40 0.1101928 323 0.8898072 1233.018 1482.182
20 4 100 40 0.1101928 323 0.8898072 1301.847 1580.649
20 5 100 41 0.1132597 321 0.8867403 1325.891 1860.367

Eyeballing the values, it looks like the pct coverage stays around 10%, but we can plot it to get a better idea. We can overlay the coverage obtained using the full dataset to get an idea of how we are performing.

Here we see that the pct_coverage doesn’t seem to change much across the folds.

Coming up next, we will explore how to perform cross validation as we increase the number of facilities added.

Ideally, there should be a way to do this using purrr, so we don’t have to fic 5 separate models, but perhaps this will change when we enable n_added to take a vector of values.

##    user  system elapsed 
##   4.404   0.387   4.812
##    user  system elapsed 
##   4.129   0.394   4.537
##    user  system elapsed 
##   4.493   0.386   4.944
##    user  system elapsed 
##   4.491   0.416   4.973
##    user  system elapsed 
##   4.257   0.396   4.668

It looks like the more facilities we add, the better the coverage…mostly.

Let’s look at this another way, with boxplots for the number of facilities added.

ggplot(bound_testing_summaries,
       aes(x = factor(n_added),
           y = n_cov)) +
    geom_boxplot() + 
    theme_minimal()

We can also compare the % coverage for the test and training datasets