Introduction
ddml
supports sparse matrices from the
Matrix
package by default. This article illustrates
double/debiased machine learning estimation with sparse matrices using
the prominent study of Angrist and Krueger (1991) (AK91, hereafter) on
returns to education.
Data Construction
One of the coefficients of interest in AK91 is the effect of years of education on the log weekly wage of American males born between 1930-1939. The authors instrument for years of schooling with quarter of birth indicators (QOB). This strategy is motivated by mandatory attendance laws that determine at what age children may drop out of school. Since children born in later quarters achieve the age cut-off after more years of schooling, higher QOB should imply more years of schooling.
Although the data is quite large
(),
a need for sparse matrices only arises when the QOB instrument is
interacted with other variables. Popular control variables are place of
birth (POB) and year of birth (YOB). Depending on whether these are
separately or jointly interacted with QOB, this results in 180 and 1530
instruments, respectively. The code snippet below constructs these two
sets of instruments as well as the matrix of controls. We use
sparse.model.matrix()
to construct sparse matrix objects as
supported by the Matrix
package.
# Load data
AK91 <- readRDS("data/AK91.rds")
# Obtain instument matrix for 180 IVs
dat_IV180 <- sparse.model.matrix(~ YOB*POB + QOB*YOB + QOB*POB, data = AK91)
which_instrument <- which(grepl("QOB", colnames(dat_IV180), fixed = TRUE))
Z_IV180 <- dat_IV180[, which_instrument]
# Obtain instument matrix for 1530 IVs
dat_IV1530 <- sparse.model.matrix(~ QOB * YOB * POB, data = AK91)
which_instrument <- which(grepl("QOB", colnames(dat_IV1530), fixed = TRUE))
Z_IV1530 <- dat_IV1530[, which_instrument]
# Obtain control matrix
X <- dat_IV1530[, -which_instrument]
To illustrate the need for sparse matrices when working with the data of AK91, consider differences in memory needed to hold the control matrix alone. For the matrix of control variables, the sparse version requires only 34.3Mb while the dense version requires 1.3Gb(=1300Mb)! The instrument matrix with 1530 instrument requires even more space and regularly fails to load into memory.
# Memory needed for the sparse control matrix
format(object.size(X), units = "Mb")
#> [1] "34.3 Mb"
# Memory needed for the dense control matrix
format(object.size(as.matrix(X)), units = "Mb")
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.3 GiB
#> [1] "1302.3 Mb"
Estimation with Sparse Matrices
The syntax for estimation with sparse matrices in ddml
is exactly the same as estimation with dense matrices. In the
below, we replicate AK91 using the simplified set of instruments and
controls.
We begin with estimating the returns to schooling using the set of
180 instruments. Following the convention of the returns to
education-literature, our estimator selects only among the instruments
but does regularize the coefficients corresponding to the control
variables. This is achieved by formulating different base learners for
the first and second stage reduced forms (see ?ddml_fpliv
),
and by setting the penalty.factor
of the control variables
to zero (see ?mdl_glmnet
).
learners_XZ <- list(list(fun = ols),
list(fun = mdl_glmnet,
args = list(cv = FALSE,
penalty.factor = c(rep(0, 510),
rep(1, 180)))))
stacking_180IV_fit <- ddml_fpliv(y = AK91$LWKLYWGE, D = AK91$EDUC,
Z = Z_IV180, X = X,
learners = list(list(fun = ols)),
learners_DX = list(list(fun = ols)),
learners_DXZ = learners_XZ,
ensemble_type = c("nnls1"),
shortstack = T,
sample_folds = 2,
silent = T)
summary(stacking_180IV_fit, type = 'HC1')
#> FPLIV estimation results:
#>
#> , , nnls1
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.14e-05 0.00113 -0.0542 9.57e-01
#> D_r 1.12e-01 0.02145 5.2258 1.73e-07
The exercise can be repeated with the larger set of 1530 instruments as well. Without support for sparse matrices, this estimation step would not be possible without very large memory.
learners_XZ <- list(list(fun = ols),
list(fun = mdl_glmnet,
args = list(cv = FALSE,
penalty.factor = c(rep(0, 510),
rep(1, 1530)))))
stacking_1530IV_fit <- ddml_fpliv(y = AK91$LWKLYWGE, D = AK91$EDUC,
Z = Z_IV1530, X = X,
learners = list(list(fun = ols)),
learners_DX = list(list(fun = ols)),
learners_DXZ = learners_XZ,
ensemble_type = c("nnls1"),
shortstack = T,
sample_folds = 2,
silent = T)
summary(stacking_1530IV_fit, type = 'HC1')
#> FPLIV estimation results:
#>
#> , , nnls1
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.98e-05 0.00111 0.045 0.9641
#> D_r 6.30e-02 0.03574 1.764 0.0778
The coefficients corresponding to the two sets of instruments are
quite different. Leveraging the ddml
functionality that
allows for specification of different sets of input variables, we
construct a stacking estimator that considers both first stage
regressions simultaneously.
# Construct column indices for combined control and instrument sets
Z_c <- cbind(Z_IV180, Z_IV1530); colnames(Z_c) <- 1:(180 + 1530)
set_IV180 <- 1:180; set_IV1530 <- 181:(180 + 1530)
learners_XZ <- list(list(fun = ols,
assign_Z = set_IV180),
list(fun = ols,
assign_Z = set_IV1530),
list(fun = mdl_glmnet,
args = list(cv = FALSE,
penalty.factor = c(rep(0, 510),
rep(1, 180))),
assign_Z = set_IV180),
list(fun = mdl_glmnet,
args = list(cv = FALSE,
penalty.factor = c(rep(0, 510),
rep(1, 1530))),
assign_Z = set_IV1530))
stacking_fit <- ddml_fpliv(y = AK91$LWKLYWGE, D = AK91$EDUC,
Z = Z_c, X = X,
learners = list(list(fun = ols)),
learners_DX = list(list(fun = ols)),
learners_DXZ = learners_XZ,
ensemble_type = c("nnls1"),
shortstack = T,
sample_folds = 2,
silent = T)
summary(stacking_fit, type = 'HC1')
#> FPLIV estimation results:
#>
#> , , nnls1
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.000143 0.00115 0.124 9.01e-01
#> D_r 0.124080 0.02012 6.167 6.97e-10
The resulting coefficient is close to the coefficient based on 180 instruments. The stacking weights confirm that the first stage estimators with 180 instruments contribute almost exclusively to the final estimate, suggesting that the expansion to 1530 instrument has little benefit for the ols or lasso-based first stage fits.
round(stacking_fit$weights$D1_XZ, 4)
#> nnls1
#> [1,] 0.7000
#> [2,] 0.0000
#> [3,] 0.2925
#> [4,] 0.0075