Skip to contents

Introduction

This article builds intuition for Neyman orthogonality—a key component of Double/Debiased Machine Learning (DML). Throughout, we focus on ordinary least squares.

Consider estimating the coefficient θ0\theta_0 on a variable of interest DD in a linear model with controls XX. Introductory econometrics courses typically discuss partialling out XX before estimating θ0\theta_0. Two variants of this idea lead to the same point estimate:

  • Procedure A (partial DD only): residualize DD with respect to XX, then regress YY on D̃\tilde{D}.
  • Procedure B (partial both): residualize both YY and DD with respect to XX, then regress Ỹ\tilde{Y} on D̃\tilde{D}.

It is tempting to take the first-step coefficients from partialling out XX from DD and YY as given, and compute standard errors only for the second-step regression coefficient. As it turns out, this naive approach results in valid inference in Procedure B, but is incorrect for Procedure A. Here, we aim to explain why one procedure is immune to first-step estimation error while the other is not—a property conveniently summarized by the notion of Neyman orthogonality.

The article covers:

  1. Setting up the two procedures and comparing their output to full OLS
  2. Deriving how first-step estimation error propagates—or doesn’t—into inference about θ0\theta_0
  3. Introducing Neyman orthogonality and its implications
  4. Discussing the bridge from OLS to nonparametric/ML first steps and ddml_plm

Two-Step Procedures with Identical Coefficient Estimates

To fix ideas, we generate data from a linear model with a single control:

Yi=θ0Di+β0Xi+Ui,Y_i = \theta_0 D_i + \beta_0 X_i + U_i,

where Di=αXi+ViD_i = \alpha X_i + V_i so that XX is correlated with both DD and YY.

library(ddml)
set.seed(48105)

n <- 500
theta_0 <- 1
beta_0 <- 2
alpha <- 0.8

X <- rnorm(n)
V <- rnorm(n)
D <- alpha * X + V
U <- rnorm(n)
Y <- theta_0 * D + beta_0 * X + U

Procedure A: Residualize DD Only

For our first procedure, we partial out XX from DD, then regress YY on the residual:

D_tilde <- residuals(lm(D ~ X))
fit_A <- lm(Y ~ D_tilde)

Procedure B: Residualize Both YY and DD

For our second procedure, we partial out XX from both DD and YY, then regress the residuals on each other:

Y_tilde <- residuals(lm(Y ~ X))
fit_B <- lm(Y_tilde ~ D_tilde - 1)

Comparing to Full OLS

It is useful to compare both two-step procedures with conventional multiple regression.

library(sandwich)
library(lmtest)

fit_full <- lm(Y ~ D + X)

theta_A <- coef(fit_A)["D_tilde"]
theta_B <- coef(fit_B)["D_tilde"]
theta_full <- coef(fit_full)["D"]

se_A <- sqrt(diag(vcovHC(fit_A, type = "HC0")))["D_tilde"]
se_B <- sqrt(diag(vcovHC(fit_B, type = "HC0")))["D_tilde"]
se_full <- sqrt(diag(vcovHC(fit_full, type = "HC0")))["D"]

cat("Full OLS:      lm(Y ~ D + X)\n")
#> Full OLS:      lm(Y ~ D + X)
cat("  theta_hat:", round(theta_full, 4),
    "  SE:", round(se_full, 4), "\n\n")
#>   theta_hat: 0.9988   SE: 0.047

cat("Procedure A:   lm(Y ~ D_tilde)\n")
#> Procedure A:   lm(Y ~ D_tilde)
cat("  theta_hat:", round(theta_A, 4),
    "  SE:", round(se_A, 4), "\n\n")
#>   theta_hat: 0.9988   SE: 0.1188

cat("Procedure B:   lm(Y_tilde ~ D_tilde - 1)\n")
#> Procedure B:   lm(Y_tilde ~ D_tilde - 1)
cat("  theta_hat:", round(theta_B, 4),
    "  SE:", round(se_B, 4), "\n")
#>   theta_hat: 0.9988   SE: 0.047

All three point estimates are identical. This is no accident: D̃\tilde{D} is orthogonal to XX by construction, so omitting XX does not bias the coefficient on D̃\tilde{D}.

But the standard errors tell a different story. Procedure B matches full OLS exactly—both point estimate and standard error. Procedure A gives a standard error that is far too large.

In both procedures, the software computes the sandwich standard error treating first-step coefficients as fixed. If both procedures ignore first-step estimation uncertainty, why does Procedure A fail while Procedure B succeeds?

The First-Order Impact of Nuisance Estimation

Estimation of θ0\theta_0 in each procedure is based on solving a sample average of a score function m(Wi;θ,η)=0m(W_i; \theta, \eta) = 0, where η\eta denotes nuisance parameters.

For Procedure A, the score is

mA(Wi;θ,ηD)=D̃i(YiD̃iθ),m_A(W_i; \theta, \eta_D) = \tilde{D}_i(Y_i - \tilde{D}_i\theta),

where D̃i=DiXiηD\tilde{D}_i = D_i - X_i'\eta_D and the nuisance parameter is η=ηD\eta = \eta_D (the coefficient from the regression of DD on XX).

For Procedure B, the score is

mB(Wi;θ,η)=D̃i(ỸiD̃iθ),m_B(W_i; \theta, \eta) = \tilde{D}_i(\tilde{Y}_i - \tilde{D}_i\theta),

where Ỹi=YiXiηY\tilde{Y}_i = Y_i - X_i'\eta_Y and the nuisance parameter is η=(ηY,ηD)\eta = (\eta_Y, \eta_D) (the coefficients from the regressions of YY and DD on XX, respectively).

Given the score, software computes the sandwich standard error by treating η̂\hat\eta as known:

SÊ(θ̂)=Ĵ21n2i=1nm(Wi;θ̂,η̂)2,\widehat{\text{SE}}(\hat\theta) = \sqrt{\hat{J}^{-2} \cdot \frac{1}{n^2} \sum_{i=1}^n m(W_i;\, \hat\theta,\, \hat\eta)^2}\,,

where Ĵ=1niθm(Wi;θ̂,η̂)\hat{J} = \frac{1}{n}\sum_i \frac{\partial}{\partial\theta} m(W_i;\, \hat\theta,\, \hat\eta) is the sample Jacobian. Both procedures share the same Jacobian ĴE[D̃2]\hat{J} \approx -E[\tilde{D}^2], so the SE difference traces entirely to the squared scores in the numerator. At θ0\theta_0, Procedure A’s score residual YiD̃iθ0Y_i - \tilde{D}_i\theta_0 still contains terms involving XiX_i, inflating the numerator. Procedure B’s score residual ỸiD̃iθ0=Ui\tilde{Y}_i - \tilde{D}_i\theta_0 = U_i contains only the structural error—hence its smaller SE.

Whether the asymptotic distribution of θ̂\hat\theta depends on η̂\hat\eta can be assessed via a standard Taylor expansion of the sample moment around the true parameters (θ0,η0)(\theta_0, \eta_0) (see, e.g., Ahrens et al., 2026):

n(θ̂θ0)=Jθ1[1nim(Wi;θ0,η0)CLT+1niηm(Wi;θ0,η0)(η̂η0)(): first-order impact of nuisance estimation]\sqrt{n}(\hat\theta - \theta_0) = -J_\theta^{-1}\biggl[ \underbrace{\frac{1}{\sqrt{n}}\sum_i m(W_i; \theta_0, \eta_0)}_{\text{CLT}} + \underbrace{\frac{1}{\sqrt{n}}\sum_i \frac{\partial}{\partial \eta} m(W_i; \theta_0, \eta_0) (\hat\eta - \eta_0)}_{(\star):\text{ first-order impact of nuisance estimation}} \biggr]

plus higher order terms. The term ()(\star) captures the first-order impact of estimating the nuisance parameter η0\eta_0. If ()(\star) vanishes, inference about θ0\theta_0 can proceed as if η0\eta_0 were known.

Procedure A: ()(\star) Does Not Vanish

Procedure A’s score depends on ηD\eta_D through D̃i=DiXiηD\tilde{D}_i = D_i - X_i'\eta_D. The pathwise derivative of the expected score with respect to the nuisance parameter is:

ηDE[mA(W;θ0,ηD)]|ηD,0=E[XX]βY|X\frac{\partial}{\partial \eta_D'}\, E[m_A(W;\, \theta_0, \eta_D)]\bigg|_{\eta_{D,0}} = -E[XX']\,\beta_{Y|X}

where βY|X=θ0γD|X+β0\beta_{Y|X} = \theta_0\gamma_{D|X} + \beta_0 is the coefficient from the regression of YY on XX. This is non-zero whenever XX is related to YY.

As a consequence, estimation error in η̂D\hat\eta_D propagates directly into θ̂A\hat\theta_A. The term ()(\star) does not vanish, and the naive standard error—which ignores ()(\star)—is wrong.

Procedure B: ()(\star) Vanishes

Procedure B’s score depends on two nuisance parameters: ηY\eta_Y (from YY on XX) and ηD\eta_D (from DD on XX). The pathwise derivatives of the expected score are:

ηYE[mB]=E[D̃X]=0\frac{\partial}{\partial \eta_Y'} E[m_B] = -E[\tilde{D}\, X'] = 0

because D̃X\tilde{D} \perp X by the projection of DD on XX, and

ηDE[mB]=E[XỸ]+2θ0E[XD̃]=E[XU]+θ0E[XD̃]=0\frac{\partial}{\partial \eta_D'} E[m_B] = -E[X\tilde{Y}] + 2\theta_0 E[X\tilde{D}] = -E[XU] + \theta_0 E[X\tilde{D}] = 0

because Ỹ=D̃θ0+U\tilde{Y} = \tilde{D}\theta_0 + U and both UU and D̃\tilde{D} are orthogonal to XX.

Both pathwise derivatives vanish. The term ()(\star) in the expansion is zero, so first-step estimation of η̂Y\hat\eta_Y and η̂D\hat\eta_D does not affect the asymptotic distribution of θ̂B\hat\theta_B—even to first order. The naive standard error is correct as-is.

Numerical Verification

We can verify the pathwise derivatives numerically:

J_gamma_A <- mean(X * (Y - D_tilde * theta_0))
J_gamma_B <- mean(X * (Y_tilde - D_tilde * theta_0))

cat("Pathwise derivatives (should vanish for B only):\n")
#> Pathwise derivatives (should vanish for B only):
cat("  Procedure A: E[X*(Y  - Dt*theta)] =",
    round(J_gamma_A, 4), "\n")
#>   Procedure A: E[X*(Y  - Dt*theta)] = 2.672
cat("  Procedure B: E[X*(Yt - Dt*theta)] =",
    round(J_gamma_B, 4), "\n")
#>   Procedure B: E[X*(Yt - Dt*theta)] = 0

Neyman Orthogonality

The property that makes Procedure B’s standard error correct is called Neyman orthogonality. Formally, a score m(W;θ,η)m(W;\, \theta, \eta) is Neyman orthogonal at (θ0,η0)(\theta_0, \eta_0) if small perturbations of the nuisance parameter away from η0\eta_0 do not create first-order changes in the expected score:

λE[m(W;θ0,η0+λ(ηη0))]|λ=0=0,η.\left.\frac{\partial}{\partial \lambda} E\bigl[m(W;\,\theta_0,\,\eta_0 + \lambda(\eta - \eta_0)) \bigr]\right|_{\lambda = 0} = 0, \quad \forall\, \eta.

Procedure B’s score satisfies this condition: the expected score is insensitive to perturbations in both ηY\eta_Y and ηD\eta_D. As a result, replacing the true nuisance parameters with first-step estimates does not distort inference about θ0\theta_0.

Procedure A’s score does not satisfy Neyman orthogonality: it is sensitive to perturbations in ηD\eta_D, and the standard error computed by treating η̂D\hat\eta_D as known is incorrect.

An important practical implication is that estimators based on Neyman orthogonal scores yield inference about θ0\theta_0 that does not depend on the detailed statistical properties of the nuisance estimator η̂\hat\eta. This is useful even in classical low-dimensional settings, where it avoids cumbersome variance adjustments to account for first-step estimation. It becomes particularly important when modern flexible methods—including machine learning—are used for nuisance estimation, as only coarse convergence rates are currently available for many promising ML methods. By alleviating the first-order impact of nuisance estimation, Neyman orthogonality makes it possible to combine such methods with standard asymptotic approximations.

From OLS to Machine Learning

In our linear example, both nuisance parameters— ηY=argminηE[(YXη)2]\eta_Y = \text{argmin}_\eta\, E[(Y - X'\eta)^2] and ηD=argminηE[(DXη)2]\eta_D = \text{argmin}_\eta\, E[(D - X'\eta)^2]—are estimated by OLS. Partialling out both YY and DD yields the same estimate and standard error as full OLS of YY on (D,X)(D, X).

The partially linear regression model generalizes this. Instead of assuming linearity, we allow XX to enter flexibly:

Yi=θ0Di+g0(Xi)+εiY_i = \theta_0 D_i + g_0(X_i) + \varepsilon_i

where g0()=E[Yθ0D|X]g_0(\cdot) = E[Y - \theta_0 D | X] is an unknown, potentially complex, function of XX. The nuisance parameters become 0(X)=E[Y|X]\ell_0(X) = E[Y|X] and r0(X)=E[D|X]r_0(X) = E[D|X]—conditional expectation functions that can be estimated by machine learning.

The Neyman orthogonal score for the partially linear model is

mPLM(W;θ,η)=[(Y(X))θ(Dr(X))](Dr(X)),m_{PLM}(W;\, \theta, \eta) = [(Y - \ell(X)) - \theta(D - r(X))](D - r(X)),

which is the flexible analog of Procedure B’s partialling-out score. As in our linear example, a naive score that adjusts only DD (analogous to Procedure A) is not Neyman orthogonal and leads to invalid inference.

DML combines the Neyman orthogonal score with cross-fitting—a form of sample splitting that addresses a second source of bias (overfitting bias) arising when the nuisance estimator η̂\hat\eta and the observations used in the moment condition are statistically dependent. Together, these two ingredients form the core of the DML framework.

The ddml package implements this approach via ddml_plm:

# Example: DML with gradient boosting (not run)
fit_dml <- ddml_plm(Y, D, X,
                    learners = list(what = mdl_xgboost),
                    sample_folds = 5)
summary(fit_dml)

See ?ddml_plm for details and vignette("ddml") for a full introduction.

References

Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C B, Newey W, Robins J (2018). “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, 21(1), C1-C68.

Ahrens A, Chernozhukov V, Hansen C B, Kozbur D, Schaffer M E, Wiemann T (2025). “An Introduction to Double/Debiased Machine Learning.” Forthcoming.