Neyman Orthogonality in Linear Regression

Introduction

This article builds intuition for Neyman orthogonality—a key component of Double/Debiased Machine Learning (DML). Throughout, we focus on ordinary least squares.

Consider estimating the coefficient $\theta_0$ on a variable of interest $D$ in a linear model with controls $X$ . Introductory econometrics courses typically discuss partialling out $X$ before estimating $\theta_0$ . Two variants of this idea lead to the same point estimate:

Procedure A (partial $D$ only): residualize $D$ with respect to $X$ , then regress $Y$ on $\tilde{D}$ .
Procedure B (partial both): residualize both $Y$ and $D$ with respect to $X$ , then regress $\tilde{Y}$ on $\tilde{D}$ .

It is tempting to take the first-step coefficients from partialling out $X$ from $D$ and $Y$ as given, and compute standard errors only for the second-step regression coefficient. As it turns out, this naive approach results in valid inference in Procedure B, but is incorrect for Procedure A. Here, we aim to explain why one procedure is immune to first-step estimation error while the other is not—a property conveniently summarized by the notion of Neyman orthogonality.

The article covers:

Setting up the two procedures and comparing their output to full OLS
Deriving how first-step estimation error propagates—or doesn’t—into inference about $\theta_0$
Introducing Neyman orthogonality and its implications
Discussing the bridge from OLS to nonparametric/ML first steps and ddml_plm

Two-Step Procedures with Identical Coefficient Estimates

To fix ideas, we generate data from a linear model with a single control:

$Y_i = \theta_0 D_i + \beta_0 X_i + U_i,$

where $D_i = \alpha X_i + V_i$ so that $X$ is correlated with both $D$ and $Y$ .

library(ddml)
set.seed(48105)

n <- 500
theta_0 <- 1
beta_0 <- 2
alpha <- 0.8

X <- rnorm(n)
V <- rnorm(n)
D <- alpha * X + V
U <- rnorm(n)
Y <- theta_0 * D + beta_0 * X + U

Procedure A: Residualize $D$ Only

For our first procedure, we partial out $X$ from $D$ , then regress $Y$ on the residual:

D_tilde <- residuals(lm(D ~ X))
fit_A <- lm(Y ~ D_tilde)

Procedure B: Residualize Both $Y$ and $D$

For our second procedure, we partial out $X$ from both $D$ and $Y$ , then regress the residuals on each other:

Y_tilde <- residuals(lm(Y ~ X))
fit_B <- lm(Y_tilde ~ D_tilde - 1)

Comparing to Full OLS

It is useful to compare both two-step procedures with conventional multiple regression.

library(sandwich)
library(lmtest)

fit_full <- lm(Y ~ D + X)

theta_A <- coef(fit_A)["D_tilde"]
theta_B <- coef(fit_B)["D_tilde"]
theta_full <- coef(fit_full)["D"]

se_A <- sqrt(diag(vcovHC(fit_A, type = "HC0")))["D_tilde"]
se_B <- sqrt(diag(vcovHC(fit_B, type = "HC0")))["D_tilde"]
se_full <- sqrt(diag(vcovHC(fit_full, type = "HC0")))["D"]

cat("Full OLS:      lm(Y ~ D + X)\n")
#> Full OLS:      lm(Y ~ D + X)
cat("  theta_hat:", round(theta_full, 4),
    "  SE:", round(se_full, 4), "\n\n")
#>   theta_hat: 0.9988   SE: 0.047

cat("Procedure A:   lm(Y ~ D_tilde)\n")
#> Procedure A:   lm(Y ~ D_tilde)
cat("  theta_hat:", round(theta_A, 4),
    "  SE:", round(se_A, 4), "\n\n")
#>   theta_hat: 0.9988   SE: 0.1188

cat("Procedure B:   lm(Y_tilde ~ D_tilde - 1)\n")
#> Procedure B:   lm(Y_tilde ~ D_tilde - 1)
cat("  theta_hat:", round(theta_B, 4),
    "  SE:", round(se_B, 4), "\n")
#>   theta_hat: 0.9988   SE: 0.047

All three point estimates are identical. This is no accident: $\tilde{D}$ is orthogonal to $X$ by construction, so omitting $X$ does not bias the coefficient on $\tilde{D}$ .

But the standard errors tell a different story. Procedure B matches full OLS exactly—both point estimate and standard error. Procedure A gives a standard error that is far too large.

In both procedures, the software computes the sandwich standard error treating first-step coefficients as fixed. If both procedures ignore first-step estimation uncertainty, why does Procedure A fail while Procedure B succeeds?

The First-Order Impact of Nuisance Estimation

Estimation of $\theta_0$ in each procedure is based on solving a sample average of a score function $m(W_i; \theta, \eta) = 0$ , where $\eta$ denotes nuisance parameters.

For Procedure A, the score is

$m_A(W_i; \theta, \eta_D) = \tilde{D}_i(Y_i - \tilde{D}_i\theta),$

where $\tilde{D}_i = D_i - X_i'\eta_D$ and the nuisance parameter is $\eta = \eta_D$ (the coefficient from the regression of $D$ on $X$ ).

For Procedure B, the score is

$m_B(W_i; \theta, \eta) = \tilde{D}_i(\tilde{Y}_i - \tilde{D}_i\theta),$

where $\tilde{Y}_i = Y_i - X_i'\eta_Y$ and the nuisance parameter is $\eta = (\eta_Y, \eta_D)$ (the coefficients from the regressions of $Y$ and $D$ on $X$ , respectively).

Given the score, software computes the sandwich standard error by treating $\hat\eta$ as known:

$\widehat{\text{SE}}(\hat\theta) = \sqrt{\hat{J}^{-2} \cdot \frac{1}{n^2} \sum_{i=1}^n m(W_i;\, \hat\theta,\, \hat\eta)^2}\,,$

where $\hat{J} = \frac{1}{n}\sum_i \frac{\partial}{\partial\theta} m(W_i;\, \hat\theta,\, \hat\eta)$ is the sample Jacobian. Both procedures share the same Jacobian $\hat{J} \approx -E[\tilde{D}^2]$ , so the SE difference traces entirely to the squared scores in the numerator. At $\theta_0$ , Procedure A’s score residual $Y_i - \tilde{D}_i\theta_0$ still contains terms involving $X_i$ , inflating the numerator. Procedure B’s score residual $\tilde{Y}_i - \tilde{D}_i\theta_0 = U_i$ contains only the structural error—hence its smaller SE.

Whether the asymptotic distribution of $\hat\theta$ depends on $\hat\eta$ can be assessed via a standard Taylor expansion of the sample moment around the true parameters $(\theta_0, \eta_0)$ (see, e.g., Ahrens et al., 2026):

$\sqrt{n}(\hat\theta - \theta_0) = -J_\theta^{-1}\biggl[ \underbrace{\frac{1}{\sqrt{n}}\sum_i m(W_i; \theta_0, \eta_0)}_{\text{CLT}} + \underbrace{\frac{1}{\sqrt{n}}\sum_i \frac{\partial}{\partial \eta} m(W_i; \theta_0, \eta_0) (\hat\eta - \eta_0)}_{(\star):\text{ first-order impact of nuisance estimation}} \biggr]$

plus higher order terms. The term $(\star)$ captures the first-order impact of estimating the nuisance parameter $\eta_0$ . If $(\star)$ vanishes, inference about $\theta_0$ can proceed as if $\eta_0$ were known.

Procedure A: $(\star)$ Does Not Vanish

Procedure A’s score depends on $\eta_D$ through $\tilde{D}_i = D_i - X_i'\eta_D$ . The pathwise derivative of the expected score with respect to the nuisance parameter is:

$\frac{\partial}{\partial \eta_D'}\, E[m_A(W;\, \theta_0, \eta_D)]\bigg|_{\eta_{D,0}} = -E[XX']\,\beta_{Y|X}$

where $\beta_{Y|X} = \theta_0\gamma_{D|X} + \beta_0$ is the coefficient from the regression of $Y$ on $X$ . This is non-zero whenever $X$ is related to $Y$ .

As a consequence, estimation error in $\hat\eta_D$ propagates directly into $\hat\theta_A$ . The term $(\star)$ does not vanish, and the naive standard error—which ignores $(\star)$ —is wrong.

Procedure B: $(\star)$ Vanishes

Procedure B’s score depends on two nuisance parameters: $\eta_Y$ (from $Y$ on $X$ ) and $\eta_D$ (from $D$ on $X$ ). The pathwise derivatives of the expected score are:

$\frac{\partial}{\partial \eta_Y'} E[m_B] = -E[\tilde{D}\, X'] = 0$

because $\tilde{D} \perp X$ by the projection of $D$ on $X$ , and

$\frac{\partial}{\partial \eta_D'} E[m_B] = -E[X\tilde{Y}] + 2\theta_0 E[X\tilde{D}] = -E[XU] + \theta_0 E[X\tilde{D}] = 0$

because $\tilde{Y} = \tilde{D}\theta_0 + U$ and both $U$ and $\tilde{D}$ are orthogonal to $X$ .

Both pathwise derivatives vanish. The term $(\star)$ in the expansion is zero, so first-step estimation of $\hat\eta_Y$ and $\hat\eta_D$ does not affect the asymptotic distribution of $\hat\theta_B$ —even to first order. The naive standard error is correct as-is.

Numerical Verification

We can verify the pathwise derivatives numerically:

J_gamma_A <- mean(X * (Y - D_tilde * theta_0))
J_gamma_B <- mean(X * (Y_tilde - D_tilde * theta_0))

cat("Pathwise derivatives (should vanish for B only):\n")
#> Pathwise derivatives (should vanish for B only):
cat("  Procedure A: E[X*(Y  - Dt*theta)] =",
    round(J_gamma_A, 4), "\n")
#>   Procedure A: E[X*(Y  - Dt*theta)] = 2.672
cat("  Procedure B: E[X*(Yt - Dt*theta)] =",
    round(J_gamma_B, 4), "\n")
#>   Procedure B: E[X*(Yt - Dt*theta)] = 0

Neyman Orthogonality

The property that makes Procedure B’s standard error correct is called Neyman orthogonality. Formally, a score $m(W;\, \theta, \eta)$ is Neyman orthogonal at $(\theta_0, \eta_0)$ if small perturbations of the nuisance parameter away from $\eta_0$ do not create first-order changes in the expected score:

$\left.\frac{\partial}{\partial \lambda} E\bigl[m(W;\,\theta_0,\,\eta_0 + \lambda(\eta - \eta_0)) \bigr]\right|_{\lambda = 0} = 0, \quad \forall\, \eta.$

Procedure B’s score satisfies this condition: the expected score is insensitive to perturbations in both $\eta_Y$ and $\eta_D$ . As a result, replacing the true nuisance parameters with first-step estimates does not distort inference about $\theta_0$ .

Procedure A’s score does not satisfy Neyman orthogonality: it is sensitive to perturbations in $\eta_D$ , and the standard error computed by treating $\hat\eta_D$ as known is incorrect.

An important practical implication is that estimators based on Neyman orthogonal scores yield inference about $\theta_0$ that does not depend on the detailed statistical properties of the nuisance estimator $\hat\eta$ . This is useful even in classical low-dimensional settings, where it avoids cumbersome variance adjustments to account for first-step estimation. It becomes particularly important when modern flexible methods—including machine learning—are used for nuisance estimation, as only coarse convergence rates are currently available for many promising ML methods. By alleviating the first-order impact of nuisance estimation, Neyman orthogonality makes it possible to combine such methods with standard asymptotic approximations.

From OLS to Machine Learning

In our linear example, both nuisance parameters— $\eta_Y = \text{argmin}_\eta\, E[(Y - X'\eta)^2]$ and $\eta_D = \text{argmin}_\eta\, E[(D - X'\eta)^2]$ —are estimated by OLS. Partialling out both $Y$ and $D$ yields the same estimate and standard error as full OLS of $Y$ on $(D, X)$ .

The partially linear regression model generalizes this. Instead of assuming linearity, we allow $X$ to enter flexibly:

$Y_i = \theta_0 D_i + g_0(X_i) + \varepsilon_i$

where $g_0(\cdot) = E[Y - \theta_0 D | X]$ is an unknown, potentially complex, function of $X$ . The nuisance parameters become $\ell_0(X) = E[Y|X]$ and $r_0(X) = E[D|X]$ —conditional expectation functions that can be estimated by machine learning.

The Neyman orthogonal score for the partially linear model is

$m_{PLM}(W;\, \theta, \eta) = [(Y - \ell(X)) - \theta(D - r(X))](D - r(X)),$

which is the flexible analog of Procedure B’s partialling-out score. As in our linear example, a naive score that adjusts only $D$ (analogous to Procedure A) is not Neyman orthogonal and leads to invalid inference.

DML combines the Neyman orthogonal score with cross-fitting—a form of sample splitting that addresses a second source of bias (overfitting bias) arising when the nuisance estimator $\hat\eta$ and the observations used in the moment condition are statistically dependent. Together, these two ingredients form the core of the DML framework.

The ddml package implements this approach via ddml_plm:

# Example: DML with gradient boosting (not run)
fit_dml <- ddml_plm(Y, D, X,
                    learners = list(what = mdl_xgboost),
                    sample_folds = 5)
summary(fit_dml)

See ?ddml_plm for details and vignette("ddml") for a full introduction.

References

Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C B, Newey W, Robins J (2018). “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, 21(1), C1-C68.

Ahrens A, Chernozhukov V, Hansen C B, Kozbur D, Schaffer M E, Wiemann T (2025). “An Introduction to Double/Debiased Machine Learning.” Forthcoming.