Some Research Topics in High (and Low) Dimensional Econometrics

Tim Armstrong

April 2023

Overview of talk

High dimensional problem \(\Rightarrow\) regularized estimator \(\Rightarrow\) bias-variance tradeoff
Want to choose tuning parameter to optimize bias-variance tradeoff (for MSE, CI length, etc.)
We’ll present a general theory for doing this in finite samples using minimax linear (or affine) estimators and bias-aware fixed-length CIs
Gives (near) optimal estimates and CIs when the parameter space is convex and centrosymmetric (some results extend to asymmetry)
Finite sample results use idealized setting (normal errors, etc.). Can use this to show an asymptotic property called honesty when normality is relaxed.

Overview of talk

Bias-variance tradeoff depends on “regularity parameter” (bound on derivative or norm/number of regressors, etc.)
Problem of adaptation: can we solve this problem automatically without prior knowledge of the regularity parameter?
For inference on a single parameter: no
For inference on a many parameters: sometimes yes, but with a different notion of coverage

How I got interested in this topic

Two cultures in econometrics

High dimensional culture
- dominates nonparametric/semiparametric/regularized regression literatures
- use asymptotics to get optimal rates/constants
- if a problem is too complicated for exact results, settle for asymptotic results
Low dimensional culture
- dominates weak IV/moment inequality/“nonstandard inference” literatures
- suspicion of asymptotics without finite sample results in idealized setting
- if a problem is too complicated for exact results, solve a different problem (e.g. assume all covariates are discrete)
Both are subcultures of “data modeling culture” critiqued by Breiman (2001) (although many of his critiques are aimed specifically at low dimensional data modeling)

How I got interested in this topic

As a graduate student, I derived optimality results for conditional moment inequalities using tools from the “high dimensional culture” (bias-variance tradeoffs, optimal rates)
People working in the moment inequality literature were resistant: they were used to the low dimensional paradigm, and didn’t think that ideas from the high dimensional paradigm could be relevant.
Some of the critiques I heard were tantamount to rejecting high dimensional and nonparametric statistics entirely (e.g. “I don’t think bias-variance tradeoffs from nonparametric statistics are relevant in finite samples”)
I started working on finite sample nonparametrics in order to address these critiques. I discovered that there was indeed some work on this that had mostly been forgotten or ignored.

Plan for rest of talk

Estimating a single parameter
- minimax linear estimators
- (near) optimality; impossibility of adaptation
Estimating many parameters
- adaptive estimation
- average coverage intervals

Single parameter

Observe \(\{(x_i,Y_i)_{i=1}^n\}\) with \[ Y_i = f(x_i) + u_i,\quad E(u_i)=0 \]
Treat \(x_1,\ldots,x_n\) as fixed (if they are random, replace \(E(\cdot)\) with \(E(\cdot |x_1,\ldots,x_n)\)).
Subscript to denote dependence on \(f\) when needed: e.g. \(E_f Y_i=f(x_i)\).
Specify a (possibly infinite dimensional) parameter space \(\mathcal{F}\) such that \(f\) is assumed to be in \(\mathcal{F}\).
Interested in a scalar parameter \(Lf\) that is a linear functional of \(f\).
All other aspects of the regression function \(f\) other than \(Lf\) are nuisance parameters

Example: nonparametric regression

Consider case where \(x_i\in\mathbb{R}\) for simplicity
No parametric assumptions on \(f\), just assume that it’s “smooth”
To formalize “smooth”, specify some order \(p\) of the derivative and a bound \(C\) such that the \(p\)th derivative is bounded by \(C\): \[ \mathcal{F}=\{f: |f^{p}(x)|\le C \text{ a.e.}\} \] This is called a Hölder regularity class.
Interested in \(Lf=f(x_0)\) (or \(Lf=f'(x_0)\), etc.) where \(x_0\) is some prespecified point
Popular in regression discontinuity (difference of \(f\) at two boundary points)

Example: high dimensional controls

Linear model \[ f(w_i,z_i) = w_i\beta + z_i'\gamma \] where \(x_i=(w_i,z_i')'\), \(w_i\in \mathbb{R}\), \(z_i\in\mathbb{R}^k\)
Interested in \(\beta\) (“treatment effect of \(w_i\) after controlling for \(z_i\)”)
Assume \(\|\gamma\|\le C\) for some norm \(\|\cdot\|\) and constant \(C\) (e.g. \(\|\gamma\|=\sum_{j=1}^k |\gamma_j|\)).
Falls into our framework with \(Lf=\beta=\frac{d}{dw}f(w,z)\) and \[ \mathcal{F}=\{f(w,z)=w\beta+z'\gamma: \|\gamma\|\le C\} \]

Example: low dimensional controls

Common motivation for assumption \(\|\gamma\|\le C\): if \(k+1>n\) (more regressors than observations), we can’t do anything without some additional assumptions (OLS not even defined)
Thus the heading “high dimensional controls”
But what if \(k\ll n\) (say \(k=2\) or \(k=3\))?
We might still gain efficiency by assuming a bound \(\|\gamma\|\le C\). How to do so optimally?

Example: semiparametric regression

Same as “high dimensional controls”, but replace \(z_i'\gamma\) with \(g(z_i)\): \[ f(w_i,z_i)=w_i\beta + g(z_i) \]
Still interested in \(Lf=\beta=\frac{d}{dw}f(w,z)\).
Specify a smoothness class for \(g(\cdot)\) as in nonparametric regression (\(p\)th derivative bounded by \(C\)). Falls into our framework with \[ \mathcal{F}=\{f: f(w,z)=w\beta+g(z),\, |g^{(p)}(z)|\le C \text{ a.e.}\} \]

General theory of optimal estimators/CIs

Main takeaway: suppose
- \(Lf\) is a linear functional (usual definition of linear transformation: \(L(a f+g)=a Lf+Lg\))
- \(\mathcal{F}\) is a convex set (usual definition: \(f,g\in\mathcal{F}\), \(\lambda\in[0,1]\) implies \(\lambda f+(1-\lambda)g\in\mathcal{F}\))
- \(\mathcal{F}\) is centrosymmetric (\(f\in\mathcal{F}\) implies \(-f\in\mathcal{F}\))
These conditions hold in all of the above examples
Then minimax linear estimators and bias-aware fixed-length CIs are (near) optimal
We now describe these procedures (skipping optimality theory), using nonparametric regression as an example.

Linear estimators

A linear estimator \(\hat L\) is a linear function of \(Y=(Y_1,\ldots,Y_n)'\): \[ \hat L f = w'Y = \sum_{i=1}^n w_i Y_i \] where \(w=w(X)\) is a \(n\times 1\) vector of estimation weights that can depend on the covariates \(X=(x_1,\ldots,x_n)'\).
Example: linear regression model \(f(x_i)=x_i'\beta\), estimate \(Lf=\beta_1\) (the first coefficient) using OLS: \[ \hat L = e_1'(X'X)^{-1}X'Y \quad\text{where} \] where \(e_1=(1,0,\ldots,0)\).
This is a linear estimator with \(w'=e_1'(X'X)^{-1}X'\).

Sampling distribution of linear estimators

Recall our model: interested in \(Lf\) where \[ Y_i = f(x_i) + u_i, \quad E(u_i)=0 \] \(x_i\)s are nonrandom.
Use linearity to get bias of \(\hat L=\sum_{i=1}^n w_iY_i\): \[ E_f \hat L - Lf = \sum_{i=1}^n w_iE_fY_i - Lf = \sum_{i=1}^n w_i f(x_i) - Lf =: \operatorname{bias}_f(w) \]
For finite sample results, assume \(u_i\stackrel{iid}{\sim} N(0,\sigma^2)\). Then \(\operatorname{var}(\hat L) = \sigma^2 w'w\) and \[ \hat L - Lf \sim N\left(\operatorname{bias}_f(w), \operatorname{var}(\hat L)\right). \]

Bias-aware CI

Can’t compute bias since \(f\) is unknown. However, under our assumption \(f\in\mathcal{F}\), we can compute worst-case bias: \[ \overline{\operatorname{bias}}_{\mathcal{F}}(w) = \sup_{f\in\mathcal{F}} |\operatorname{bias}_f(w)| \]
Consider \(z\)-statistic \[ \frac{\hat L - Lf}{\operatorname{sd}(\hat L)} \sim N(t,1) \quad \text{where} \quad |t|\le \frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)} \]
Let \(\operatorname{cv}_\alpha(t)= \left\{ 1-\alpha \text{ quantile of } |N(t,1)| \right\}\). Leads to bias-aware fixed length CI: \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \]
Feasible, heteroskedasticity robust CI: replace \(\operatorname{sd}(\hat L)\) with a robust standard error

Optimal weights

Optimal weights \(w\) depend on the criterion. For CI, can optimize length: \[ 2\cdot \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \] (optimized \(w\) still leads to valid CI, since it doesn’t depend on \(Y\)).
For estimation, consider worst-case mean squared error (MSE): \[ \sup_{f\in\mathcal{F}} E_f\left(\hat L-Lf\right)^2 =\overline{\operatorname{bias}}_\mathcal{F}(w)^2 + \operatorname{var}(\hat L) \] or absolute error, etc.
If \(Lf\) gives net welfare from implementing a policy, can optimize worst-case welfare regret as advocated by Manski (see Yata (2021), Ishihara and Kitagawa (2021))

All of these criteria are increasing in \(\overline{\operatorname{bias}}_\mathcal{F}(w)^2\) and \(\operatorname{var}(\hat L)\). Thus, to find optimal weights \(w\):
1. For each \(B\ge 0\), solve \[ \min \operatorname{var}(\hat L) \quad\text{s.t.}\quad \overline{\operatorname{bias}}_\mathcal{F}(w) \le B \] where \[ \operatorname{var}(\hat L) = \sigma^2 w'w, \quad \overline{\operatorname{bias}}_\mathcal{F}(w) = \max_{f\in\mathcal{F}} \left| \sum_{i=1}^n w_i f(x_i) - Lf \right| \]
2. Find the \(B\) that minimizes the given criterion (CI length, MSE, welfare regret, etc.)
The resulting estimator is a minimax linear estimator: it is the linear estimator that minimizes worst-case performance.
There is a general formulation of Step 1 as a convex optimization problem (see Donoho (1994), Armstrong and Kolesár (2018)).

Example: nonparametric regression under Lipschitz constraint

Interested in \(Lf=f(x_0)\) for some given \(x_0\). We’ll normalize \(x_0=0\) for notational convenience.
Assume that \(p\)th derivative of \(f(x)\) is bounded by \(C\).
When \(p=1\), this is a Lipschitz constraint: \[ \mathcal{F}_{\text{Lip}} = \{f: |f'(x)|\le C \text{ a.e.}\} \]
For simplicity, we’ll consider a relaxed version of this constraint used by Sacks and Ylvisaker (1978): \[ \mathcal{F}_{\text{SY},1} = \{f: |f(x)-f(0)|\le C |x| \text{ all }x\in\mathbb{R}\} \] (turns out the results are the same in this case).

Worst-case bias

Bias of linear estimator \(\hat L=w'Y\): \[ \operatorname{bias}_f(w) = \sum_{i=1}^n w_i f(x_i) - f(0) \]
First, note that we must have \(\sum_{i=1}^n w_i=1\) to have \(\overline{\operatorname{bias}}_f(w)<\infty\). Otherwise, bias can be arbitrarily large when \(f(x)=c\) for large \(c\).
Incorporating this constraint, we can write the bias as \(\sum_{i=1}^n w_i [f(x_i) - f(0)]\). Worst-case bias: \[\begin{align*} \overline{\operatorname{bias}}_{\mathcal{F}_{\text{SY},1}}(w) &= \sup \sum_{i=1}^n w_i [f(x_i) - f(0)] \quad\text{s.t.}\quad |f(x) - f(0)|\le C|x| \text{ all }x \\ &= C \sum_{i=1}^n |w_i|\cdot |x_i| \end{align*}\] Last step follows by noting that \(f(x_i)=C |x_i|\cdot \operatorname{sign}(w_i)\) maximizes each term in the sum.

Solving for the optimal weights

Variance is \(\sigma^2 w'w=\sigma^2\sum_{i=1}^n w_i^2\). Incorporating constraint \(\sum_{i=1}^n w_i=1\) leads to optimization problem for optimal weights: \[ \min \sum_{i=1}^n w_i^2 \quad\text{s.t.}\quad C \sum_{i=1}^n |w_i|\cdot |x_i|\le B, \, \sum_{i=1}^n w_i=1. \]
Lagrangian: \[ \mathcal{L}=\frac{1}{2}\sum_{i=1}^n w_i^2 +\lambda \sum_{i=1}^n |w_i|\cdot |x_i| +\mu \left( 1-\sum_{i=1}^n w_i \right) \]

FOC for \(w_i\): \[\begin{align*} &0 = \frac{d \mathcal{L}}{d w_i} = w_i + \lambda |x_i|\cdot \text{sign}(w_i) - \mu \\ &\Longrightarrow w_i = \mu - \lambda |x_i|\cdot \text{sign}(w_i) \end{align*}\] where we use the convention that \(\operatorname{sign}(w_i)\) can be anything in \([-1,1]\) when \(w_i=0\).
Three possibilities:
- \(w_i>0 \Longrightarrow w_i = \mu - \lambda |x_i|\)
- \(w_i=0 \Longrightarrow\) there exists \(c\in [-1,1]\) such that \(0 = \mu - \lambda |x_i|\cdot c\) \(\Longrightarrow\) \(\mu-\lambda |x_i|\le 0\)
- \(w_i<0\) cannot hold for optimal \(w_i\) (can make objective smaller by setting negative \(w_i\) to 0 and decreasing one of the positive weights)
It follows that optimal weights \(w_i^*\) satisfy \(w_i^*=\max\{\mu-\lambda |x_i|,0\}\).

Incorporating constraint \(\sum_{i=1}^n w_i=1\) yields optimal weights \[ w_i^* = \frac{\max\{\mu-\lambda |x_i|,0\}}{\sum_{j=1}^n\max\{\mu-\lambda |x_j|,0\}} \]
Resulting estimator is a Nadaraya-Watson estimator: \[ \sum_{i=1}^n w_i^* Y_i = \frac{\sum_{i=1}^n Y_i k_{\operatorname{tri}}(x_i/h)}{\sum_{i=1}^n k_{\operatorname{tri}}(x_i/h)} \] where \(k_{\operatorname{tri}}(t)=\max\left\{1-|t|,0\right\}\) is the triangular kernel and \(h=\mu/\lambda\) is the bandwidth.
Optimal bandwidth depends on constraint \(B\) on bias through Lagrange multipliers \(\mu\) and \(\lambda\).

Unknown error distribution

To allow for non-normal, heteroskedastic errors (and serial/cluster correlation, etc.), replace \(\operatorname{sd}(\hat L)\) with a robust standard error for \(\hat L=\sum_{i=1}^n w_iY_i\): \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{se}_{\operatorname{rob}}}\right)\cdot \operatorname{se}_{\operatorname{rob}} \]
For example, if we assume independence but allow heteroskedasticity, we can use \(\operatorname{se}_{\operatorname{rob}} = \sum_{i=1}^n w_i^2 \hat u_i^2\) where \(\hat u_i\) are residuals.
Note that the bound \(\overline{\operatorname{bias}}_{\mathcal{F}}(w)\) on bias is valid in finite samples, so we avoid asymptotic approximations that are taboo in the “low dimensional culture”

Honesty property

This feasible CI is honest in the sense of Li (1989): it is asymptotically valid uniformly over \(f\in\mathcal{F}\) (and over sufficiently regular distributions for \(u_i\))
- This honesty property is sometimes called “uniformity in the underlying distribution”
To show this, suffices to show Lindeberg condition for weights \(w_i\). This gives central limit theorem for \(\hat L=\sum_{i=1}^n w_iY_i\).
Lindeberg condition holds for optimal weights under some general coditions in the nonparametric setting (see Armstrong and Kolesár (2018), Armstrong and Kolesár (2021b)).
Intuition: if one \(w_i\) dominates, one can reduce variance with little change in bias by making it closer to weights for nearby \(x_i\)’s

Near optimal weights

In some cases, exact optimal weights are complicated
For example, consider Hölder class with bound on second derivative (popular in RD). No closed form for optimal weights in finite samples. There are asymptotically optimal kernels, but they are also a bit cumbersome (Zhao 1997; Gao 2018).
Often, some simple class of weights is near-optimal. For example, local linear estimators with triangular (or even uniform) kernel are asymptotically near-optimal under second derivative bound (see Cheng, Fan, and Marron (1997), Armstrong and Kolesár (2020)), and there is a simple formula for worst-case bias in finite samples (see Armstrong and Kolesár (2020)).
The RDHonest software package uses this approach to compute bias-aware CIs for regression discontinuity in R and Stata.
One also can check this in finite-samples in a particular application (see Armstrong and Kolesár (2018) for an example).

Optimal weights in other settings

Optimal kernels for relaxed version of Hölder class: Sacks and Ylvisaker (1978), Armstrong and Kolesár (2018)
Optimal weights in Lipschitz/Hölder class using convex optimization software: Armstrong and Kolesár (2021a) (for ATE), Imbens and Wager (2019) (for RD)
Asymptotically optimal kernels: Zhao (1997), Gao (2018), references in Lepski and Tsybakov (2000)
High dimensional/semiparametric controls: Armstrong, Kolesár, and Kwon (2020) (resulting estimator similar to Robinson (1988))
Sparsity in high dimensions: nonconvex, but Javanmard and Montanari (2014) use minimax linear estimation based on convex bound on initial estimation error

Some additional applications

Misspecified GMM (Armstrong and Kolesár 2021b)
Density estimation (covered in Armstrong and Kolesár (2020))
Deconvolution (Ignatiadis and Wager 2021)
Diff-in-diff (Rambachan and Roth 2019)
Weak IV robust fuzzy RD (Noack and Rothe 2019)

Optimality and impossibility of adaptation

We have restricted attention to linear procedures, but it can be shown that the minimax linear procedure is (near) minimax among all procedures (Donoho (1994), Armstrong and Kolesár (2018), Yata (2021))
Optimal estimator and CI require knowledge of bound \(C\) on derivative to choose weights and form CI.
Can we avoid this?
For inference, no (see Low (1997), Armstrong and Kolesár (2018))
For estimation, typically lose \(\log n\) term (see Lepskii (1991))
Intuition: can’t estimate bias without auxiliary information

How to “break” impossibility of adaptation?

Auxiliary assumptions (need to lead to nonconvex or asymmetric parameter space):
- “Self-similarity” assumptions proposed by Giné and Nickl (2010) (although this introduces additional adaptation issues; see Armstrong (2021))
- Global polynomial “rule of thumb” (Armstrong and Kolesár 2020)
Asymmetric parameter space (can allow for some degree of adaptation depending on the setting)
With many parameters: average coverage

Asymmetric parameter space

What if \(\mathcal{F}\) is convex but asymmetric (e.g. monotonicity, sign constraints)? Consider minimax affine estimator \(\hat L=w_0+w'Y\).
Convex optimization problem similar to symmetric case. Set \(w_0\) to equalize worst-case upward and downward bias.
Minimax affine estimators still near minimax among all estimators (Donoho 1994).
In contrast to centrosymmetric case, some degree of adaptation for CIs is possible. See, among others, Cai and Low (2004), Armstrong and Kolesár (2018), Armstrong (2015), Kwon and Kwon (2020a), Kwon and Kwon (2020b)
Intuition: use shape constraints to bound bias of affine estimators.

Many parameters

Biased estimates \(\hat \theta_j\) of \(\theta_j\) for \(j=1,\ldots,J\): \[ E\hat\theta_j-\theta_j = \operatorname{bias}_j, \quad \operatorname{var}\left(\hat\theta_j\right) = \operatorname{se}_j^2 \]
Consider compound MSE: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}_j^2\right) \]
In many cases, compound loss makes adaptation possible
Why? Often possible to consistently estimate \(\frac{1}{J}\sum_{j=1}^n \operatorname{bias}_j^2\) as \(J\to\infty\) (using, e.g. cross-validation), but not when \(J=1\).

Confidence intervals

What is compound loss for CIs? Consider average coverage criterion at level \(\alpha\): \[ \frac{1}{J}\sum_{j=1}^J P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]
Weaker than classical coverage, which requires \[ \max_{1\le j \le J} P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]
We can “break” impossibility of adaptation by only requiring average coverage.

Example: empirical Bayes

Unbiased estimate \(Y_j\) of parameter \(\theta_j\) for \(j=1,\ldots,J\) (e.g. fixed effects)
To reduce variance, regress \(Y_j\) on covariates \(X_j\) and “shrink” towards regression estimate \(X_j'\hat\beta\): \[ \hat\theta_j = w Y_j + (1-w) X_j'\hat\beta \]
Shrinkage weight \(w\) trades off bias vs variance:
- \(w=1\Longrightarrow\) zero bias, large variance
- \(w=0\Longrightarrow\) large bias, zero variance

For simplicity, assume \(X_j'\hat\beta=0\) and that \(Y_j\) is exactly normal with same variance \(\sigma^2\) for all \(j\): \[ Y_j \sim N(\theta_j,\sigma^2) \] with \(\sigma^2\) known
Shrinkage estimator \(\hat\theta_j= w Y_j\sim N(\theta_j+\operatorname{bias}_j,\operatorname{se}^2)\) where \(\operatorname{bias}_j=w\theta_j-\theta_j=(w-1)\theta_j\) and \(\operatorname{se}^2=w^2\sigma^2\).
Compound MSE: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}^2\right) = (w-1)^2 \frac{1}{J} \sum_{j=1}^J \theta_j^2 + w^2\sigma^2 \]

Compound MSE is \((w-1)^2 \mu_{2,J} + w^2\sigma^2\) where \(\mu_{2,J}=\frac{1}{J} \sum_{j=1}^J \theta_j^2\).
If \(\mu_{2,J}\) is known, can minimize compound MSE over \(w\) to get \[ w_{\operatorname{opt}}=\frac{\mu_{2,J}}{\mu_{2,J}+\sigma^2} \]
Estimate of \(\mu_{2,J}\): \[ \hat\mu_{2,J} = \frac{1}{J}\sum_{j=1}^J Y_j^2 - \sigma^2. \] (consistent as \(J\to\infty\)).
Plugging in \(\hat\mu_{2,J}\) (with possible modifications for finite sample performance) gives a Stein estimator, an example of an empirical Bayes estimator

Adaptation and empirical Bayes

The Stein estimator is adaptive to \(\mu_{2,J}\): it achieves the same compound MSE (as \(j\to\infty\)) as an estimator that uses prior knowledge of \(\mu_{2,J}\)
Empirical Bayes motivation: \(\theta_j\)’s drawn from same prior \(N(0,\mu_2)\), \(\hat\mu_{2,J}\) is an estimate of \(\mu_2\). The empirical Bayes estimate is adaptive to the unknown prior variance \(\mu_2\).
- This requires slightly different notation from what we’re using here (to get our current notation from a Bayesian setup, we’d condition on \(\theta\))
How can we use this estimator to get an adaptive CI (or average coverage interval)? We will describe an approach from Armstrong, Kolesár, and Plagborg-Møller (2022).

Average coverage intervals

Consider CIs based on shrinkage estimator \(\hat\theta_j=wY_j\): \[ \operatorname{CI}_j=\left\{\hat\theta_j \pm \chi \cdot \operatorname{se}\right\} \] where \(\chi\) is some critical value.
Basic idea: use \(\hat\mu_{2,J}\) to estimate a bound on average non-coverage.
First look at individual non-coverage. Note that \((\hat\theta_j-\theta_j)/\operatorname{se}\stackrel{d}{=}Z+\text{bias}_j/\operatorname{se}\) where \(Z\sim N(0,1)\). Thus, letting \(b_j=\text{bias}_j/\operatorname{se}\), we have \[\begin{align*} &P\left(\theta_j\notin \operatorname{CI}_j\right) =P\left(\left|\frac{\hat\theta_j-\theta_j}{\operatorname{se}}\right|\ge \chi\right) = P\left(\left|Z+b_j\right|\ge \chi\right) \\ &= \Phi\left(-\chi-b_j\right) + \Phi\left(-\chi+b_j\right) =: r(b_j,\chi). \end{align*}\]

Thus, average non-coverage is \[ \frac{1}{J}\sum_{j=1}^J P\left(\theta_j\notin \operatorname{CI}_j\right) =\frac{1}{J}\sum_{j=1}^J r(b_j,\chi) =E_{b\sim F_J} r(b,\chi) \] where \(F_J\) is the empirical distribution of \(b_j=\operatorname{bias}_j/\operatorname{se}=[(w-1)\theta_j]/(w\sigma^2)\) over \(j=1,\ldots, J\).
How can we bound this? Note that we can use \(\mu_{2,J}\) to compute \(E_{b\sim F_J} b^2\): \[ E_{b\sim F_J} b^2 = \frac{1}{J}\sum_{j=1}^J \frac{(w-1)^2}{w^2\sigma^2}\theta_j^2 = \frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J} \]
Thus, average non-coverage is bounded by \(\rho\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J},\chi\right)\) where \[ \rho\left(t,\chi\right) :=\sup E_{b\sim F} r(b_j,\chi) \quad\text{s.t.}\quad E_F b^2 \le t. \]

Let \(\operatorname{cva}_\alpha(t)=\rho^{-1}(t,\alpha)\) where the inverse is with respect to the second argument. Setting \(\chi=\operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right)\) gives the interval \[ \operatorname{CI}_j=\hat\theta_j \pm \operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right) \operatorname{se} \]
Feasible average coverage interval: plug in consistent estimate \(\hat\mu_{2,J}\) of \(\mu_{2,J}\).
To \(\operatorname{cva}_\alpha(t)\) we need to compute \(\rho(t,\alpha)\). This is a convex program in a one-dimensional probability distribution \(F\), which can be solved by gridding over \(F\). Actually, it turns out that there is a closed-form solution (see Armstrong, Kolesár, and Plagborg-Møller (2022)).
Implemented in R, Matlab and Stata packages.

Empirical Bayes interpretation

The shrinkage estimator \(\hat\theta_j=wY_j\) with the plug-in optimal weight \(w=\hat\mu_{2,J}/(\hat\mu_{2,J}+\sigma^2)\) has an empirical Bayes interpretation:
- Assume \(\theta_j\)’s drawn from common prior \(N(0,\mu_2)\) with unknown \(\mu_2\).
- Estimate \(\mu_2\) and use this to form optimal Bayesian estimate (posterior mean)
What about the average coverage intervals we constructed? They are empirical Bayes confidence intervals (EBCIs) in the sense of Morris (1983): they have ex-ante coverage at least \(1-\alpha\) over the prior distribution of \(\theta_j\).
If the \(N(0,\mu_2)\) prior is correct, one can use a parametric EBCI by plugging \(\hat\mu_{2,J}\) into Bayesian credible interval (see Morris (1983)).
In contrast, the intervals we constructed are robust to failure of \(N(0,\mu_2)\) prior.

Other applications of average coverage

Average coverage (and related notions of coverage) originated in nonparametric regression literature.
Ideas described above extend to other settings where estimate of average squared bias-sd ratio is available
Multiple testing: certain multiple testing adjustments still yield false discovery rate (FDR) control when applied to average coverage controlling CIs/tests (Armstrong (2022))
Thus, we can “break” impossibility of adaptation results by requiring FDR control instead of classical size control for each test

Software

RDHonest (package for R and Stata): bias-aware CIs in regression discontinuity
ATEHonest (package for R): average treatment effects under unconfoundedness
ebci/ebreg (package for R, Stata and Matlab): empirical Bayes confidence intervals

References

Armstrong, Timothy B. 2015. “Adaptive Testing on a Regression Function at a Point.” The Annals of Statistics 43 (5): 2086–2101. https://doi.org/10.1214/15-AOS1342.

———. 2021. “Adaptation Bounds for Confidence Bands Under Self-Similarity.” Bernoulli 27 (2): 1348–70. https://doi.org/10.3150/20-BEJ1277.

———. 2022. “False Discovery Rate Adjustments for Average Significance Level Controlling Tests.” arXiv. https://doi.org/10.48550/arXiv.2209.13686.

Armstrong, Timothy B., and Michal Kolesár. 2018. “Optimal Inference in a Class of Regression Models.” Econometrica 86 (2): 655–83. https://doi.org/10.3982/ECTA14434.

———. 2020. “Simple and Honest Confidence Intervals in Nonparametric Regression.” Quantitative Economics 11 (1): 1–39. https://doi.org/10.3982/QE1199.

———. 2021a. “Finite-Sample Optimal Estimation and Inference on Average Treatment Effects Under Unconfoundedness.” Econometrica 89 (3): 1141–77. https://doi.org/10.3982/ECTA16907.

———. 2021b. “Sensitivity Analysis Using Approximate Moment Condition Models.” Quantitative Economics 12 (1): 77-108-108. https://doi.org/10.3982/TE1559.

Armstrong, Timothy B., Michal Kolesár, and Soonwoo Kwon. 2020. “Bias-Aware Inference in Regularized Regression Models.” arXiv:2012.14823 [Econ, Stat], December. http://arxiv.org/abs/2012.14823.

Armstrong, Timothy B., Michal Kolesár, and Mikkel Plagborg-Møller. 2022. “Robust Empirical Bayes Confidence Intervals.” Econometrica 90 (6): 2567–2602. https://doi.org/10.3982/ECTA18597.

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231. https://doi.org/10.1214/ss/1009213726.

Cai, T. Tony, and Mark G. Low. 2004. “An Adaptation Theory for Nonparametric Confidence Intervals.” The Annals of Statistics 32 (5): 1805–40. http://www.jstor.org/stable/3448556.

Cheng, Ming-Yen, Jianqing Fan, and J. S. Marron. 1997. “On Automatic Boundary Corrections.” The Annals of Statistics 25 (4): 1691–1708. https://doi.org/10.1214/aos/1031594737.

Donoho, David L. 1994. “Statistical Estimation and Optimal Recovery.” The Annals of Statistics 22 (1): 238–70. https://doi.org/10.1214/aos/1176325367.

Gao, Wayne Yuan. 2018. “Minimax Linear Estimation at a Boundary Point.” Journal of Multivariate Analysis 165 (May): 262–69. https://doi.org/10.1016/j.jmva.2018.01.001.

Giné, Evarist, and Richard Nickl. 2010. “Confidence Bands in Density Estimation.” The Annals of Statistics 38 (2): 1122–70. https://doi.org/10.1214/09-AOS738.

Ignatiadis, Nikolaos, and Stefan Wager. 2021. “Confidence Intervals for Nonparametric Empirical Bayes Analysis.” Journal of the American Statistical Association 0 (0): 1–18. https://doi.org/10.1080/01621459.2021.2008403.

Imbens, Guido, and Stefan Wager. 2019. “Optimized Regression Discontinuity Designs.” The Review of Economics and Statistics 101 (2): 264–78. https://doi.org/10.1162/rest_a_00793.

Ishihara, Takuya, and Toru Kitagawa. 2021. “Evidence Aggregation for Treatment Choice.”

Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research 15 (82): 2869–909. http://jmlr.org/papers/v15/javanmard14a.html.

Kwon, Koohyun, and Soonwoo Kwon. 2020a. “Adaptive Inference in Multivariate Nonparametric Regression Models Under Monotonicity.” arXiv:2011.14219 [Econ, Math, Stat], November. http://arxiv.org/abs/2011.14219.

———. 2020b. “Inference in Regression Discontinuity Designs Under Monotonicity.” arXiv:2011.14216 [Econ, Stat], November. http://arxiv.org/abs/2011.14216.

Lepski, O. V., and A. B. Tsybakov. 2000. “Asymptotically Exact Nonparametric Hypothesis Testing in Sup-Norm and at a Fixed Point.” Probability Theory and Related Fields 117 (1): 17–48. https://doi.org/10.1007/s004400050265.

Lepskii, O. 1991. “On a Problem of Adaptive Estimation in Gaussian White Noise.” Theory of Probability & Its Applications 35 (3): 454–66. https://doi.org/10.1137/1135065.

Li, Ker-Chau. 1989. “Honest Confidence Regions for Nonparametric Regression.” The Annals of Statistics 17 (3): 1001–8. https://doi.org/10.1214/aos/1176347253.

Low, Mark G. 1997. “On Nonparametric Confidence Intervals.” The Annals of Statistics 25 (6): 2547–54. https://doi.org/10.1214/aos/1030741084.

Morris, Carl N. 1983. “Parametric Empirical Bayes Confidence Intervals.” In Scientific Inference, Data Analysis, and Robustness, edited by G. E. P. Box, Tom Leonard, and Chien-Fu Wu, 25–50. Academic Press. https://doi.org/10.1016/B978-0-12-121160-8.50008-9.

Noack, Claudia, and Christoph Rothe. 2019. “Bias-Aware Inference in Fuzzy Regression Discontinuity Designs.” arXiv:1906.04631 [Econ, Stat], June. http://arxiv.org/abs/1906.04631.

Rambachan, Ashesh, and Jonathan Roth. 2019. “An Honest Approach to Parallel Trends.”

Robinson, P. M. 1988. “Root-N-Consistent Semiparametric Regression.” Econometrica 56 (4): 931–54. https://doi.org/10.2307/1912705.

Sacks, Jerome, and Donald Ylvisaker. 1978. “Linear Estimation for Approximately Linear Models.” The Annals of Statistics 6 (5): 1122–37. https://doi.org/10.1214/aos/1176344315.

Yata, Kohei. 2021. “Optimal Decision Rules Under Partial Identification.” arXiv. https://doi.org/10.48550/arXiv.2111.04926.

Zhao, Linda H. 1997. “Minimax Linear Estimation in a White Noise Problem.” The Annals of Statistics 25 (2): 745–55. https://doi.org/10.1214/aos/1031833671.