Neyman Orthogonality in Linear Regression
Source:vignettes/articles/neyman_orthogonality.Rmd
neyman_orthogonality.RmdIntroduction
This article builds intuition for Neyman orthogonality—a key component of Double/Debiased Machine Learning (DML). Throughout, we focus on ordinary least squares.
Consider estimating the coefficient on a variable of interest in a linear model with controls . Introductory econometrics courses typically discuss partialling out before estimating . Two variants of this idea lead to the same point estimate:
- Procedure A (partial only): residualize with respect to , then regress on .
- Procedure B (partial both): residualize both and with respect to , then regress on .
It is tempting to take the first-step coefficients from partialling out from and as given, and compute standard errors only for the second-step regression coefficient. As it turns out, this naive approach results in valid inference in Procedure B, but is incorrect for Procedure A. Here, we aim to explain why one procedure is immune to first-step estimation error while the other is not—a property conveniently summarized by the notion of Neyman orthogonality.
The article covers:
- Setting up the two procedures and comparing their output to full OLS
- Deriving how first-step estimation error propagates—or doesn’t—into inference about
- Introducing Neyman orthogonality and its implications
- Discussing the bridge from OLS to nonparametric/ML first steps and
ddml_plm
Two-Step Procedures with Identical Coefficient Estimates
To fix ideas, we generate data from a linear model with a single control:
where so that is correlated with both and .
library(ddml)
set.seed(48105)
n <- 500
theta_0 <- 1
beta_0 <- 2
alpha <- 0.8
X <- rnorm(n)
V <- rnorm(n)
D <- alpha * X + V
U <- rnorm(n)
Y <- theta_0 * D + beta_0 * X + UProcedure A: Residualize Only
For our first procedure, we partial out from , then regress on the residual:
Procedure B: Residualize Both and
For our second procedure, we partial out from both and , then regress the residuals on each other:
Comparing to Full OLS
It is useful to compare both two-step procedures with conventional multiple regression.
library(sandwich)
library(lmtest)
fit_full <- lm(Y ~ D + X)
theta_A <- coef(fit_A)["D_tilde"]
theta_B <- coef(fit_B)["D_tilde"]
theta_full <- coef(fit_full)["D"]
se_A <- sqrt(diag(vcovHC(fit_A, type = "HC0")))["D_tilde"]
se_B <- sqrt(diag(vcovHC(fit_B, type = "HC0")))["D_tilde"]
se_full <- sqrt(diag(vcovHC(fit_full, type = "HC0")))["D"]
cat("Full OLS: lm(Y ~ D + X)\n")
#> Full OLS: lm(Y ~ D + X)
cat(" theta_hat:", round(theta_full, 4),
" SE:", round(se_full, 4), "\n\n")
#> theta_hat: 0.9988 SE: 0.047
cat("Procedure A: lm(Y ~ D_tilde)\n")
#> Procedure A: lm(Y ~ D_tilde)
cat(" theta_hat:", round(theta_A, 4),
" SE:", round(se_A, 4), "\n\n")
#> theta_hat: 0.9988 SE: 0.1188
cat("Procedure B: lm(Y_tilde ~ D_tilde - 1)\n")
#> Procedure B: lm(Y_tilde ~ D_tilde - 1)
cat(" theta_hat:", round(theta_B, 4),
" SE:", round(se_B, 4), "\n")
#> theta_hat: 0.9988 SE: 0.047All three point estimates are identical. This is no accident: is orthogonal to by construction, so omitting does not bias the coefficient on .
But the standard errors tell a different story. Procedure B matches full OLS exactly—both point estimate and standard error. Procedure A gives a standard error that is far too large.
In both procedures, the software computes the sandwich standard error treating first-step coefficients as fixed. If both procedures ignore first-step estimation uncertainty, why does Procedure A fail while Procedure B succeeds?
The First-Order Impact of Nuisance Estimation
Estimation of in each procedure is based on solving a sample average of a score function , where denotes nuisance parameters.
For Procedure A, the score is
where and the nuisance parameter is (the coefficient from the regression of on ).
For Procedure B, the score is
where and the nuisance parameter is (the coefficients from the regressions of and on , respectively).
Given the score, software computes the sandwich standard error by treating as known:
where is the sample Jacobian. Both procedures share the same Jacobian , so the SE difference traces entirely to the squared scores in the numerator. At , Procedure A’s score residual still contains terms involving , inflating the numerator. Procedure B’s score residual contains only the structural error—hence its smaller SE.
Whether the asymptotic distribution of depends on can be assessed via a standard Taylor expansion of the sample moment around the true parameters (see, e.g., Ahrens et al., 2026):
plus higher order terms. The term captures the first-order impact of estimating the nuisance parameter . If vanishes, inference about can proceed as if were known.
Procedure A: Does Not Vanish
Procedure A’s score depends on through . The pathwise derivative of the expected score with respect to the nuisance parameter is:
where is the coefficient from the regression of on . This is non-zero whenever is related to .
As a consequence, estimation error in propagates directly into . The term does not vanish, and the naive standard error—which ignores —is wrong.
Procedure B: Vanishes
Procedure B’s score depends on two nuisance parameters: (from on ) and (from on ). The pathwise derivatives of the expected score are:
because by the projection of on , and
because and both and are orthogonal to .
Both pathwise derivatives vanish. The term in the expansion is zero, so first-step estimation of and does not affect the asymptotic distribution of —even to first order. The naive standard error is correct as-is.
Numerical Verification
We can verify the pathwise derivatives numerically:
J_gamma_A <- mean(X * (Y - D_tilde * theta_0))
J_gamma_B <- mean(X * (Y_tilde - D_tilde * theta_0))
cat("Pathwise derivatives (should vanish for B only):\n")
#> Pathwise derivatives (should vanish for B only):
cat(" Procedure A: E[X*(Y - Dt*theta)] =",
round(J_gamma_A, 4), "\n")
#> Procedure A: E[X*(Y - Dt*theta)] = 2.672
cat(" Procedure B: E[X*(Yt - Dt*theta)] =",
round(J_gamma_B, 4), "\n")
#> Procedure B: E[X*(Yt - Dt*theta)] = 0Neyman Orthogonality
The property that makes Procedure B’s standard error correct is called Neyman orthogonality. Formally, a score is Neyman orthogonal at if small perturbations of the nuisance parameter away from do not create first-order changes in the expected score:
Procedure B’s score satisfies this condition: the expected score is insensitive to perturbations in both and . As a result, replacing the true nuisance parameters with first-step estimates does not distort inference about .
Procedure A’s score does not satisfy Neyman orthogonality: it is sensitive to perturbations in , and the standard error computed by treating as known is incorrect.
An important practical implication is that estimators based on Neyman orthogonal scores yield inference about that does not depend on the detailed statistical properties of the nuisance estimator . This is useful even in classical low-dimensional settings, where it avoids cumbersome variance adjustments to account for first-step estimation. It becomes particularly important when modern flexible methods—including machine learning—are used for nuisance estimation, as only coarse convergence rates are currently available for many promising ML methods. By alleviating the first-order impact of nuisance estimation, Neyman orthogonality makes it possible to combine such methods with standard asymptotic approximations.
From OLS to Machine Learning
In our linear example, both nuisance parameters— and —are estimated by OLS. Partialling out both and yields the same estimate and standard error as full OLS of on .
The partially linear regression model generalizes this. Instead of assuming linearity, we allow to enter flexibly:
where is an unknown, potentially complex, function of . The nuisance parameters become and —conditional expectation functions that can be estimated by machine learning.
The Neyman orthogonal score for the partially linear model is
which is the flexible analog of Procedure B’s partialling-out score. As in our linear example, a naive score that adjusts only (analogous to Procedure A) is not Neyman orthogonal and leads to invalid inference.
DML combines the Neyman orthogonal score with cross-fitting—a form of sample splitting that addresses a second source of bias (overfitting bias) arising when the nuisance estimator and the observations used in the moment condition are statistically dependent. Together, these two ingredients form the core of the DML framework.
The ddml package implements this approach via
ddml_plm:
# Example: DML with gradient boosting (not run)
fit_dml <- ddml_plm(Y, D, X,
learners = list(what = mdl_xgboost),
sample_folds = 5)
summary(fit_dml)See ?ddml_plm for details and
vignette("ddml") for a full introduction.
References
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C B, Newey W, Robins J (2018). “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, 21(1), C1-C68.
Ahrens A, Chernozhukov V, Hansen C B, Kozbur D, Schaffer M E, Wiemann T (2025). “An Introduction to Double/Debiased Machine Learning.” Forthcoming.