Estimator of the Mean Squared Prediction Error Using Cross-Validation
Source:R/crossval.R
crossval.RdEstimator of the mean squared prediction error of different learners using cross-validation.
Usage
crossval(
y,
X,
learners,
cv_folds = 10,
cluster_variable = seq_along(y),
cv_subsamples = NULL,
silent = FALSE,
parallel = NULL
)Arguments
- y
The outcome variable.
- X
A (sparse) matrix of predictive variables.
- learners
learnersis a list of lists, each containing three named elements:whatThe base learner function. The function must be such that it predicts a named inputyusing a named inputX.argsOptional arguments to be passed towhat.assign_XAn optional vector of column indices corresponding to variables inXthat are passed to the base learner.
Omission of the
argselement results in default arguments being used inwhat. Omission ofassign_Xresults in inclusion of all predictive variables inX.- cv_folds
Number of folds used for cross-validation.
- cluster_variable
A vector of cluster indices.
- cv_subsamples
List of vectors with sample indices for cross-validation.
- silent
Boolean to silence estimation updates.
- parallel
An optional named list with parallel processing options. When
NULL(the default), computation is sequential. Supported fields:coresNumber of cores to use.
exportCharacter vector of object names to export to parallel workers (for custom learners that reference global objects).
packagesCharacter vector of additional package names to load on workers (for custom learners that use packages not imported by
ddml).
Value
crossval returns a list containing the following components:
mspeA vector of MSPE estimates, each corresponding to a base learner (in chronological order).
r2A vector of cross-validated \(R^2\) values, each corresponding to a base learner (in chronological order).
cv_residA matrix of out-of-sample residuals, each column corresponding to a base learner (in chronological order).
cv_subsamplesPass-through of
cv_subsamples. See above.
Details
crossval estimates the mean squared prediction error
(MSPE) of \(J\) base learners via \(K\)-fold
cross-validation. It is the inner workhorse of the stacking
machinery used by ensemble_weights to determine
ensemble weights.
Given a generic conditional expectation function \(f_0(\cdot)\) (e.g., \(E[Y\vert X]\), \(E[D\vert X]\)), let \(\{I_1, \ldots, I_K\}\) be a \(K\)-fold partition of \(\{1, \ldots, n\}\) and let \(\hat{f}_j^{(-k)}\) denote learner \(j\) trained on all observations outside fold \(I_k\). The out-of-sample residual for observation \(i \in I_k\) is
\(\hat{e}_{i,j} = y_i - \hat{f}_j^{(-k)}(X_i).\)
Since every observation belongs to exactly one fold, this yields a complete \(n \times J\) residual matrix. The cross-validated MSPE for learner \(j\) is
\(\widehat{\textrm{MSPE}}_j = n^{-1} \sum_{i=1}^{n} \hat{e}_{i,j}^2,\)
and the cross-validated \(R^2\) is
\(\hat{R}^2_j = 1 - \widehat{\textrm{MSPE}}_j \,/\, \hat{\sigma}^2_y,\)
where \(\hat{\sigma}^2_y\) is the sample variance of \(y\).
See also
Other utilities:
crosspred(),
ddml(),
diagnostics(),
ensemble(),
ensemble_weights(),
shortstacking()
Examples
# Construct variables from the included Angrist & Evans (1998) data
y = AE98[, "worked"]
X = AE98[, c("morekids", "age","agefst","black","hisp","othrace","educ")]
# Compare ols, lasso, and ridge using 4-fold cross-validation
cv_res <- crossval(y, X,
learners = list(list(what = ols),
list(what = mdl_glmnet),
list(what = mdl_glmnet,
args = list(alpha = 0))),
cv_folds = 4,
silent = TRUE)
cv_res$mspe
#> [1] 0.2363514 0.2363506 0.2363641