Skip to contents

Introduction

This article illustrates key features of ddml by estimating the effect of 401k participation on retirement savings. In particular, the example showcases the following features:

  • Estimation using a single machine learner
  • Estimation using multiple machine learners via short-stacking
  • Estimation using multiple machine learners and different sets of control variables
  • Estimation using different sets of machine learners for continuous and binary outcome and treatment variables

After introduction of the data, the article illustrates each above feature in turn.

Data Construction

The 401K data considered here is from the Survey of Income and Program Participation (SIPP) from the year 1991. We use the data taken from the application in Chernozhukov et al. (2018).

SIPP91 <- readRDS("data/SIPP91.rds")

The endogenous variable of interest is an indicator equal to 1 if the person participates in a 401k. We are interested in estimating its effect on net financial assets and control for various individual characteristics (e.g., age and income).

nobs <- nrow(SIPP91)
y <- as.matrix(SIPP91$net_tfa)
D <- as.matrix(SIPP91$p401)
X <- as.matrix(SIPP91[, c("age", "inc", "educ", "fsize",
                      "marr", "twoearn", "db", "pira", "hown")])

Estimation using a Single Machine Learner

The simplest Double/Debiased Machine Learning estimator for the causal effect of 401k participation on financial assets considers a single machine learner to (nonparametrically) control for individual characteristics. We consider gradient boosting in the code snippet below (see ?ddml_plm and ?mdl_xgboost for details).

A comparison with a simple multiple linear regression estimate indicates a difference in the point estimates of about one standard deviation, suggesting that there are at least some non-linearities present in the data.

# PLM with gradient boosting
xgboost_fit <- ddml_plm(y = y, D = D, X = X,
                        learners = list(what = mdl_xgboost,
                                        args = list(nrounds = 50)),
                        sample_folds = 5,
                        silent = TRUE)
summary(xgboost_fit)
#> DDML estimation: Partially Linear Model 
#> Obs: 9915   Folds: 5
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1             13916       1544    9.01   <2e-16 ***
#> (Intercept)     -541        636   -0.85      0.4    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Comparison to multiple linear regression
lm_fit <- lm(y ~ D + X)
summary(lm_fit)
#> 
#> Call:
#> lm(formula = y ~ D + X)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -507235  -17014   -2134   10406 1430000 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -3.343e+04  4.265e+03  -7.837  5.1e-15 ***
#> D            1.160e+04  1.345e+03   8.624  < 2e-16 ***
#> Xage         6.351e+02  5.938e+01  10.696  < 2e-16 ***
#> Xinc         9.143e-01  3.015e-02  30.330  < 2e-16 ***
#> Xeduc       -6.190e+02  2.278e+02  -2.717 0.006597 ** 
#> Xfsize      -1.007e+03  4.487e+02  -2.245 0.024776 *  
#> Xmarr        9.172e+02  1.791e+03   0.512 0.608532    
#> Xtwoearn    -1.936e+04  1.572e+03 -12.312  < 2e-16 ***
#> Xdb         -4.915e+03  1.335e+03  -3.683 0.000232 ***
#> Xpira        2.892e+04  1.464e+03  19.748  < 2e-16 ***
#> Xhown        8.830e+02  1.320e+03   0.669 0.503420    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 55580 on 9904 degrees of freedom
#> Multiple R-squared:  0.2352, Adjusted R-squared:  0.2344 
#> F-statistic: 304.6 on 10 and 9904 DF,  p-value: < 2.2e-16

Estimation using Multiple Machine Learners

Instead of considering just a single – and possibly ill-tuned – machine learner to control for covariates, it is often helpful to combine a diverse set of learners. The code snippet below considers a set of six learners (see ?ols, ?mdl_glmnet, ?mdl_ranger for details). The learners are combined to minimize out-of-sample mean square prediction error in the reduced form of the outcome and the endogenous variable on the controls, respectively. To reduce the computational burden, we consider short-stacking (see also vignette("stacking")).

The estimate is nearly identical to the estimate using just gradient boosting.

# Specify set of learners
learners <- list(list(what = ols),
                 list(what = mdl_glmnet),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 4)),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 8)),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 50)),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 150)))

# PLM with short-stacking
stacking_fit <- ddml_plm(y = y, D = D, X = X,
                         learners = learners,
                         ensemble_type = "nnls",
                         sample_folds = 5,
                         shortstack = TRUE,
                         silent = TRUE)
summary(stacking_fit)
#> DDML estimation: Partially Linear Model 
#> Obs: 9915   Folds: 5  Stacking: short-stack
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1             13769       1516    9.08   <2e-16 ***
#> (Intercept)    -1194        537   -2.22    0.026 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To better understand which machine learners contribute to the final estimate, we can inspect the stacking diagnostics. The weights show that the two random forest estimators contribute the most in both reduced forms, followed by the lasso estimator. Neither gradient boosting nor linear regression are assigned substantial weight.

diagnostics(stacking_fit)
#> Stacking diagnostics: Partially Linear Model 
#> Obs: 9915 
#> 
#>   y_X:
#>    learner       mspe     r2 weight_nnls
#>  learner_1 3125114171 0.2255      0.1926
#>  learner_2 3126016675 0.2253      0.0000
#>  learner_3 3045999690 0.2451      0.0000
#>  learner_4 2910378985 0.2787      0.7992
#>  learner_5 3435902221 0.1485      0.0824
#>  learner_6 3716930203 0.0789      0.0000
#>       nnls 2890726991 0.2836          NA
#> 
#>   D1_X:
#>    learner   mspe      r2 weight_nnls
#>  learner_1 0.1725  0.1072      0.2333
#>  learner_2 0.1725  0.1071      0.0000
#>  learner_3 0.1717  0.1112      0.1916
#>  learner_4 0.1711  0.1144      0.5891
#>  learner_5 0.1860  0.0371      0.0000
#>  learner_6 0.2045 -0.0588      0.0000
#>       nnls 0.1707  0.1163          NA
#> 
#> Note: Ensemble MSPE and R2 for short-stacking rely on full-sample weights
#>        and represent in-sample fit over cross-fitted base predictions.

Estimation with Different Sets of Control Variables

The large weight assigned to random forests suggests that non-linearities and potential interactions may be important. While random forests and gradient boosting adaptively constructs such non-linearities, linear methods such as the lasso need to pre-specify these variables to capture more complex patterns in the data. A better set of machine learners to consider would thus consider different sets of variables for different machine learners:

  • linear regression and lasso with an extended set of control variables
  • random forests and gradient boosting with the original set of control variables

We extend the control variables using a simple third-order polynomial expansion of the non-binary control variables (with interactions). These are combine with the original control variables. In addition, we construct to sets of indices that reference the two sets of controls.

X_series <- poly(X[, c("age", "inc", "educ", "fsize")], degree = 3)
X_c <- cbind(X, X_series)
X_baseline <- 1:dim(X)[2] # baseline variables and indicators
X_extended <- 5:dim(X_c)[2] # indicators & series expansion

ddml supports passing different sets of control variables to different base learners via the assign_X argument, which includes the column-indices corresponding to the desired variables of the control matrix.

# Specify base learners with different sets of controls
learners <- list(list(what = ols,
                      assign_X = X_baseline),
                 list(what = ols,
                      assign_X = X_extended),
                 list(what = mdl_glmnet,
                      assign_X = X_extended),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 4),
                      assign_X = X_baseline),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 8),
                      assign_X = X_baseline),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 50),
                      assign_X = X_baseline),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 150),
                      assign_X = X_baseline))

# PLM with short-stacking
stacking_fit <- ddml_plm(y = y, D = D, X = X_c,
                         learners = learners,
                         ensemble_type = "nnls",
                         sample_folds = 5,
                         shortstack = TRUE,
                         silent = TRUE)
summary(stacking_fit)
#> DDML estimation: Partially Linear Model 
#> Obs: 9915   Folds: 5  Stacking: short-stack
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1             14120       1500    9.42   <2e-16 ***
#> (Intercept)     -960        532   -1.80    0.071 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To assess whether the additional effort in expanding the control variables had an effect on the composition of the reduced form estimates, we can again inspect the stacking diagnostics. Indeed, the new base learner that combines lasso with the polynomial expansion of the controls is assigned the largest weight, followed by the random forest estimators.

diagnostics(stacking_fit)
#> Stacking diagnostics: Partially Linear Model 
#> Obs: 9915 
#> 
#>   y_X:
#>    learner       mspe      r2 weight_nnls
#>  learner_1 3125864729  0.2253      0.0000
#>  learner_2 2877766026  0.2868      0.0000
#>  learner_3 2858270329  0.2916      0.7204
#>  learner_4 3021160887  0.2513      0.0000
#>  learner_5 2926909974  0.2746      0.3365
#>  learner_6 4121257708 -0.0213      0.0000
#>  learner_7 4614176089 -0.1435      0.0000
#>       nnls 2838742034  0.2965          NA
#> 
#>   D1_X:
#>    learner   mspe      r2 weight_nnls
#>  learner_1 0.1727  0.1062      0.0123
#>  learner_2 0.1719  0.1104      0.0370
#>  learner_3 0.1714  0.1130      0.3921
#>  learner_4 0.1718  0.1106      0.0773
#>  learner_5 0.1710  0.1147      0.3848
#>  learner_6 0.1825  0.0554      0.1035
#>  learner_7 0.2001 -0.0360      0.0038
#>       nnls 0.1705  0.1176          NA
#> 
#> Note: Ensemble MSPE and R2 for short-stacking rely on full-sample weights
#>        and represent in-sample fit over cross-fitted base predictions.

Estimation with Continuous and Binary Outcome and Treatment Variables

Thus far, we have considered the same set of base learners for the reduced form of the outcome on controls and the reduced form of the endogenous variable on the controls. In contrast to net financial assets which is (approximately) continuous, participating in a 401k is a binary indicator. It may thus sensible to consider reduced form estimators with support on the unit-interval for the latter reduced form. For example, we could consider logistic regression instead of linear regression.

Equipped with the included mdl_glm wrapper for generalized linear models (see also ?mdl_glm), we now specify a second set of base-learners for the reduced form of the binary endogenous variable of interest on the controls. As in the previous section, we consider different sets of control variables as well. When estimating the Double/Debiased Machine Learning estimator, we pass the second set of base learners via the learners_DX argument.

# Specify an additional set of learners for the reduced form E[D|X]
learners_DX <- list(list(what = mdl_glm,
                         args = list(family = "binomial"),
                      assign_X = X_baseline),
                 list(what = mdl_glm,
                         args = list(family = "binomial"),
                      assign_X = X_extended),
                 list(what = mdl_glmnet,
                      args = list(family = "binomial"),
                      assign_X = X_extended),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 4),
                      assign_X = X_baseline),
                 list(what = mdl_ranger,
                      args = list(num.trees = 100,
                                  max.depth = 8),
                      assign_X = X_baseline),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 50),
                      assign_X = X_baseline),
                 list(what = mdl_xgboost,
                      args = list(nrounds = 150),
                      assign_X = X_baseline))

# PLM with short-stacking and different sets of learners
stacking_fit <- ddml_plm(y = y, D = D, X = X_c,
                         learners = learners,
                         learners_DX = learners_DX,
                         ensemble_type = "nnls",
                         sample_folds = 5,
                         shortstack = TRUE,
                         silent = TRUE)
summary(stacking_fit)
#> DDML estimation: Partially Linear Model 
#> Obs: 9915   Folds: 5  Stacking: short-stack
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1             14025       1490    9.41   <2e-16 ***
#> (Intercept)     -387        534   -0.72     0.47    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The diagnostics associated with the new base learners show slightly larger weights:

diagnostics(stacking_fit)
#> Stacking diagnostics: Partially Linear Model 
#> Obs: 9915 
#> 
#>   y_X:
#>    learner       mspe      r2 weight_nnls
#>  learner_1 3118730135  0.2271      0.0652
#>  learner_2 2902039913  0.2808      0.0000
#>  learner_3 2871909268  0.2883      0.8547
#>  learner_4 3089111001  0.2344      0.0000
#>  learner_5 3026275768  0.2500      0.0347
#>  learner_6 3906568364  0.0319      0.0656
#>  learner_7 4253407158 -0.0541      0.0000
#>       nnls 2863776446  0.2903          NA
#> 
#>   D1_X:
#>    learner   mspe      r2 weight_nnls
#>  learner_1 0.1735  0.1020      0.0673
#>  learner_2 0.1717  0.1114      0.3660
#>  learner_3 0.1715  0.1125      0.0000
#>  learner_4 0.1718  0.1106      0.1588
#>  learner_5 0.1714  0.1127      0.3949
#>  learner_6 0.1856  0.0394      0.0203
#>  learner_7 0.2037 -0.0544      0.0000
#>       nnls 0.1707  0.1162          NA
#> 
#> Note: Ensemble MSPE and R2 for short-stacking rely on full-sample weights
#>        and represent in-sample fit over cross-fitted base predictions.

Advanced: Rapid Refits with the Passthrough API

Estimating multiple machine learners via cross-fitting can be computationally expensive. Suppose that, after inspecting stacking_fit above, you decide you additionally want to see the results if you had used the singlebest ensemble type instead of nnls.

Rather than waiting for all base learners to be refit from scratch, ddml allows you to extract the out-of-sample predictions from the initial fit and pass them directly into a new ddml_plm call via the fitted argument.

Because stacking_fit already contains the cross-fitted predictions and sample splits in its fitted component, we can bypass the estimation step entirely.

# Recompute the structural parameter instantly using 'singlebest'
stacking_refit <- ddml_plm(
  y = y, D = D, X = X_c,
  learners = learners,
  learners_DX = learners_DX,
  ensemble_type = "singlebest",
  sample_folds = 5,
  fitted = stacking_fit$fitted, # Pass the predictions directly
  splits = stacking_fit$splits, # Pass the splits used directly
  silent = TRUE
)
summary(stacking_refit)
#> DDML estimation: Partially Linear Model 
#> Obs: 9915   Folds: 5
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1          13963.03    1492.40    9.36   <2e-16 ***
#> (Intercept)    -5.36     535.14   -0.01     0.99    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bonus: Instrumental Variable Estimation

Instead of considering the partially linear regression model, we may also be interested in an instrumental variable approach. A commonly used instrument for participation in a 401k is 401k eligibility.

The below code-snippet estimates the partially linear IV model using the two sets of machine learners introduced in previous sections. Because the instrument is binary, we use the generalized linear models to estimate the reduced form of 401k eligibility on the controls.

# 401k eligibility as the instrument
Z <- as.matrix(SIPP91$e401)

# PLIV with short-stacking and different sets of learners
stacking_IV_fit <- ddml_pliv(y = y, D = D, Z = Z, X = X_c,
                         learners = learners,
                         learners_DX = learners_DX,
                         learners_ZX = learners_DX,
                         ensemble_type = "nnls",
                         sample_folds = 5,
                         shortstack = TRUE,
                         silent = TRUE)
summary(stacking_IV_fit)
#> DDML estimation: Partially Linear IV Model 
#> Obs: 9915   Folds: 5  Stacking: short-stack
#> 
#>             Estimate Std. Error z value Pr(>|z|)    
#> D1             13151       1888    6.96  3.3e-12 ***
#> (Intercept)     -966        530   -1.83    0.068 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The stacking diagnostics for the IV model:

diagnostics(stacking_IV_fit)
#> Stacking diagnostics: Partially Linear IV Model 
#> Obs: 9915 
#> 
#>   y_X:
#>    learner       mspe     r2 weight_nnls
#>  learner_1 3120947021 0.2266      0.0000
#>  learner_2 2864489352 0.2901      0.0000
#>  learner_3 2836320542 0.2971      0.7226
#>  learner_4 3061626328 0.2413      0.0000
#>  learner_5 2914205541 0.2778      0.2624
#>  learner_6 3518126794 0.1281      0.0626
#>  learner_7 3808805316 0.0561      0.0099
#>       nnls 2812746541 0.3029          NA
#> 
#>   D1_X:
#>    learner   mspe      r2 weight_nnls
#>  learner_1 0.1736  0.1013      0.0370
#>  learner_2 0.1718  0.1110      0.2420
#>  learner_3 0.1714  0.1126      0.1068
#>  learner_4 0.1719  0.1102      0.0433
#>  learner_5 0.1709  0.1152      0.5302
#>  learner_6 0.1840  0.0478      0.0469
#>  learner_7 0.2024 -0.0475      0.0000
#>       nnls 0.1705  0.1174          NA
#> 
#>   Z1_X:
#>    learner   mspe     r2 weight_nnls
#>  learner_1 0.2014 0.1375      0.0000
#>  learner_2 0.1973 0.1550      0.3248
#>  learner_3 0.1972 0.1555      0.0000
#>  learner_4 0.1977 0.1533      0.0235
#>  learner_5 0.1959 0.1608      0.6553
#>  learner_6 0.2121 0.0916      0.0000
#>  learner_7 0.2313 0.0092      0.0000
#>       nnls 0.1955 0.1627          NA
#> 
#> Note: Ensemble MSPE and R2 for short-stacking rely on full-sample weights
#>        and represent in-sample fit over cross-fitted base predictions.

References

Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C B, Newey W, Robins J (2018). “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, 21(1), C1-C68.