# Example on the Effect of 401k Participation

Source:`vignettes/articles/example_401k.Rmd`

`example_401k.Rmd`

## Introduction

This article illustrates key features of `ddml`

by
estimating the effect of 401k participation on retirement savings. In
particular, the example showcases the following features:

- Estimation using a single machine learner
- Estimation using multiple machine learners via short-stacking
- Estimation using multiple machine learners
*and*different sets of control variables - Estimation using different sets of machine learners for continuous and binary outcome and treatment variables

After introduction of the data, the article illustrates each above feature in turn.

## Data construction

The 401K data considered here is from the Survey of Income and Program Participation (SIPP) from the year 1991. We use the data taken from the application in Chernozhukov et al. (2018).

`SIPP91 <- readRDS("data/SIPP91.rds")`

The endogenous variable of interest is an indicator equal to 1 if the person participates in a 401k. We are interested in estimating its effect on net financial assets and control for various individual characteristics (e.g., age and income).

## Estimation using a Single Machine Learner

The simplest double/debiased machine learning estimator for the
causal effect of 401k participation on financial assets considers a
single machine learner to (nonparametrically) control for individual
characteristics. We consider gradient boosting in the code snippet below
(see `?ddml_plm`

and `?mdl_xgboost`

for
details).

A comparison with a simple multiple linear regression estimate indicates a difference in the point estimates of about one standard deviation, suggesting that there are at least some non-linearities present in the data.

```
# PLM with gradient boosting
xgboost_fit <- ddml_plm(y = y, D = D, X = X,
learners = list(what = mdl_xgboost,
args = list(nrounds = 300)),
sample_folds = 10,
silent = T)
round(summary(xgboost_fit)[2, , 1], 2)
#> PLM estimation results:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> 13915.72 1603.56 8.68 0.00
# Comparison to multiple linear regression
lm_fit <- lm(y ~ D + X)
round(summary(lm_fit)$coefficients[2, ], 2)
#> Estimate Std. Error t value Pr(>|t|)
#> 11600.89 1345.12 8.62 0.00
```

## Estimation using Multiple Machine Learners

Instead of considering just a single – and possibly ill-tuned –
machine learner to control for covariates, it is often helpful to
combine a diverse set of learners. The code snippet below considers a
set of six learners (see `?ols`

, `?mdl_glmnet`

,
`?mdl_ranger`

for details). The learners are combined to
minimize out-of-sample mean square prediction error in the reduced form
of the outcome and the endogenous variable on the controls,
respectively. To reduce the computational burden, we consider
short-stacking (see also `vignette("stacking")`

).

The estimate is nearly identical to the estimate using just gradient boosting.

```
# Specify set of learners
learners <- list(list(fun = ols),
list(fun = mdl_glmnet),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 4)),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 12)),
list(fun = mdl_xgboost,
args = list(nrounds = 100)),
list(fun = mdl_xgboost,
args = list(nrounds = 300)))
# PLM with short-stacking
stacking_fit <- ddml_plm(y = y, D = D, X = X,
learners = learners,
ensemble_type = "nnls",
sample_folds = 10,
shortstack = T,
silent = T)
round(summary(stacking_fit)[2, , 1], 2)
#> PLM estimation results:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> 13636.00 1526.01 8.94 0.00
```

To better understand which machine learners contribute to the final estimate, we can take a look at the assigned stacking weights. Each row corresponds to a base-learner (in chronological order) while the columns indicate the corresponding reduced form. The weights show that the two random forest estimators contribute the most in both reduced forms, followed by the lasso estimator. Neither gradient boosting nor linear regression are assigned substantial weight.

```
sapply(stacking_fit$weights, round, 4)
#> y_X D1_X
#> [1,] 0.0048 0.0197
#> [2,] 0.2113 0.2817
#> [3,] 0.2973 0.3400
#> [4,] 0.5770 0.3504
#> [5,] 0.0000 0.0000
#> [6,] 0.0000 0.0238
```

## Estimation with Different Sets of Control Variables

The large weight assigned to random forests suggests that non-linearities and potential interactions may be important. While random forests and gradient boosting adaptively constructs such non-linearities, linear methods such as the lasso need to pre-specify these variables to capture more complex patterns in the data. A better set of machine learners to consider would thus consider different sets of variables for different machine learners:

- linear regression and lasso with an extended set of control variables
- random forests and gradient boosting with the original set of control variables

We extend the control variables using a simple third-order polynomial expansion of the non-binary control variables (with interactions). These are combine with the original control variables. In addition, we construct to sets of indices that reference the two sets of controls.

```
X_series <- poly(X[, c("age", "inc", "educ", "fsize")], degree = 3)
X_c <- cbind(X, X_series)
X_baseline <- 1:dim(X)[2] # baseline variables and indicators
X_extended <- 5:dim(X_c)[2] # indicators & series expansion
```

`ddml`

supports passing different sets of control
variables to different base learners via the `assign_X`

argument, which includes the column-indices corresponding to the desired
variables of the control matrix.

```
# Specify base learners with different sets of controls
learners <- list(list(fun = ols,
assign_X = X_baseline),
list(fun = ols,
assign_X = X_extended),
list(fun = mdl_glmnet,
assign_X = X_extended),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 4),
assign_X = X_baseline),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 12),
assign_X = X_baseline),
list(fun = mdl_xgboost,
args = list(nrounds = 100),
assign_X = X_baseline),
list(fun = mdl_xgboost,
args = list(nrounds = 300),
assign_X = X_baseline))
# PLM with short-stacking
stacking_fit <- ddml_plm(y = y, D = D, X = X_c,
learners = learners,
ensemble_type = "nnls",
sample_folds = 10,
shortstack = T,
silent = T)
round(summary(stacking_fit)[2, , 1], 2)
#> PLM estimation results:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> 14363.22 1500.04 9.58 0.00
```

To assess whether the additional effort in expanding the control variables had an effect on the composition of the reduced form estimates, we can again inspect the stacking weights. Indeed, the new base learner that combines lasso with the polynomial expansion of the controls is assigned the largest weight, followed by the random forest estimators.

```
sapply(stacking_fit$weights, round, 4)
#> y_X D1_X
#> [1,] 0.0176 0.0000
#> [2,] 0.0000 0.1219
#> [3,] 0.7368 0.4056
#> [4,] 0.0000 0.1693
#> [5,] 0.2702 0.2613
#> [6,] 0.0022 0.0522
#> [7,] 0.0000 0.0048
```

## Estimation with Continious and Binary Outcome and Treatment Variables

Thus far, we have considered the same set of base learners for the reduced form of the outcome on controls and the reduced form of the endogenous variable on the controls. In contrast to net financial assets which is (approximately) continuous, participating in a 401k is a binary indicator. It may thus sensible to consider reduced form estimators with support on the unit-interval for the latter reduced form. For example, we could consider logistic regression instead of linear regression.

Equipped with the included `mdl_glm`

wrapper for
generalized linear models (see also `?mdl_glm`

), we now
specify a second set of base-learners for the reduced form of the binary
endogenous variable of interest on the controls. As in the previous
section, we consider different sets of control variables as well. When
estimating the double/debiased machine learning estimator, we pass the
second set of base learners via the `learners_DX`

argument.

```
# Specify an additional set of learners for the reduced form E[D|X]
learners_DX <- list(list(fun = mdl_glm,
args = list(family = "binomial"),
assign_X = X_baseline),
list(fun = mdl_glm,
args = list(family = "binomial"),
assign_X = X_extended),
list(fun = mdl_glmnet,
args = list(family = "binomial"),
assign_X = X_extended),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 4),
assign_X = X_baseline),
list(fun = mdl_ranger,
args = list(num.trees = 250,
max.depth = 12),
assign_X = X_baseline),
list(fun = mdl_xgboost,
args = list(nrounds = 100),
assign_X = X_baseline),
list(fun = mdl_xgboost,
args = list(nrounds = 300),
assign_X = X_baseline))
# PLM with short-stacking and different sets of learners
stacking_fit <- ddml_plm(y = y, D = D, X = X_c,
learners = learners,
learners_DX = learners_DX,
ensemble_type = "nnls",
sample_folds = 10,
shortstack = T,
silent = T)
round(summary(stacking_fit)[2, , 1], 2)
#> PLM estimation results:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> 14097.13 1483.65 9.50 0.00
```

The weights associated with the new base learners are now slightly larger:

```
sapply(stacking_fit$weights, round, 4)
#> y_X D1_X
#> [1,] 0.0132 0.0627
#> [2,] 0.0000 0.2663
#> [3,] 0.7411 0.0086
#> [4,] 0.0000 0.4221
#> [5,] 0.2690 0.2552
#> [6,] 0.0000 0.0000
#> [7,] 0.0000 0.0302
```

## Bonus: Instrumental Variable Estimation

Instead of considering the partially linear regression model, we may
also be interested in an instrumental variable approach. A commonly used
instrument for participation in a 401k is 401k *eligibility*.

The below code-snippet estimates the partially linear IV model using the two sets of machine learners introduced in previous sections. Because the instrument is binary, we use the generalized linear models to estimate the reduced form of 401k eligibility on the controls.

```
# 401k eligibility as the instrument
Z <- as.matrix(SIPP91$e401)
# PLIV with short-stacking and different sets of learners
stacking_IV_fit <- ddml_pliv(y = y, D = D, Z = Z, X = X_c,
learners = learners,
learners_DX = learners_DX,
learners_ZX = learners_DX,
ensemble_type = "nnls",
sample_folds = 10,
shortstack = T,
silent = T)
round(summary(stacking_IV_fit)[2, , 1], 2)
#> PLIV estimation results:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> 12986.88 1871.81 6.94 0.00
# Stacking weights associated with each base learner
sapply(stacking_IV_fit$weights, round, 4)
#> y_X D1_X Z1_X
#> [1,] 0.0000 0.0647 0.0000
#> [2,] 0.0000 0.2271 0.2414
#> [3,] 0.7474 0.0080 0.0157
#> [4,] 0.0000 0.4030 0.5337
#> [5,] 0.2953 0.3036 0.2501
#> [6,] 0.0000 0.0351 0.0000
#> [7,] 0.0000 0.0000 0.0000
```