Tim Armstrong

April 2023

High dimensional problem \(\Rightarrow\) regularized estimator \(\Rightarrow\) bias-variance tradeoff

Want to choose tuning parameter to optimize bias-variance tradeoff (for MSE, CI length, etc.)

We’ll present a general theory for doing this

*in finite samples*using*minimax linear (or affine) estimators*and*bias-aware fixed-length CIs*Gives (near) optimal estimates and CIs when the parameter space is

*convex*and*centrosymmetric*(some results extend to asymmetry)Finite sample results use idealized setting (normal errors, etc.). Can use this to show an asymptotic property called

*honesty*when normality is relaxed.

Bias-variance tradeoff depends on “regularity parameter” (bound on derivative or norm/number of regressors, etc.)

Problem of

*adaptation*: can we solve this problem automatically without prior knowledge of the regularity parameter?For inference on a single parameter:

*no*For inference on a many parameters: sometimes

*yes*, but with a*different notion of coverage*

- High dimensional culture
- dominates nonparametric/semiparametric/regularized regression literatures
- use asymptotics to get optimal rates/constants
- if a problem is too complicated for exact results, settle for asymptotic results

- Low dimensional culture
- dominates weak IV/moment inequality/“nonstandard inference” literatures
- suspicion of asymptotics without finite sample results in idealized setting
- if a problem is too complicated for exact results, solve a different problem (e.g. assume all covariates are discrete)

- Both are subcultures of “data modeling culture” critiqued by Breiman (2001) (although many of his critiques are aimed specifically at low dimensional data modeling)

As a graduate student, I derived optimality results for conditional moment inequalities using tools from the “high dimensional culture” (bias-variance tradeoffs, optimal rates)

People working in the moment inequality literature were resistant: they were used to the low dimensional paradigm, and didn’t think that ideas from the high dimensional paradigm could be relevant.

Some of the critiques I heard were tantamount to rejecting high dimensional and nonparametric statistics entirely (e.g. “I don’t think bias-variance tradeoffs from nonparametric statistics are relevant in finite samples”)

I started working on finite sample nonparametrics in order to address these critiques. I discovered that there was indeed some work on this that had mostly been forgotten or ignored.

- Estimating a single parameter
- minimax linear estimators
- (near) optimality; impossibility of adaptation

- Estimating many parameters
- adaptive estimation
- average coverage intervals

Observe \(\{(x_i,Y_i)_{i=1}^n\}\) with \[ Y_i = f(x_i) + u_i,\quad E(u_i)=0 \]

Treat \(x_1,\ldots,x_n\) as fixed (if they are random, replace \(E(\cdot)\) with \(E(\cdot |x_1,\ldots,x_n)\)).

Subscript to denote dependence on \(f\) when needed: e.g. \(E_f Y_i=f(x_i)\).

Specify a (possibly infinite dimensional) parameter space \(\mathcal{F}\) such that \(f\) is assumed to be in \(\mathcal{F}\).

Interested in a

*scalar*parameter \(Lf\) that is a*linear functional*of \(f\).All other aspects of the regression function \(f\) other than \(Lf\) are

*nuisance parameters*

Consider case where \(x_i\in\mathbb{R}\) for simplicity

No parametric assumptions on \(f\), just assume that it’s “smooth”

To formalize “smooth”, specify some order \(p\) of the derivative and a bound \(C\) such that the \(p\)th derivative is bounded by \(C\): \[ \mathcal{F}=\{f: |f^{p}(x)|\le C \text{ a.e.}\} \] This is called a

*Hölder regularity class*.Interested in \(Lf=f(x_0)\) (or \(Lf=f'(x_0)\), etc.) where \(x_0\) is some prespecified point

Popular in

*regression discontinuity*(difference of \(f\) at two boundary points)

Linear model \[ f(w_i,z_i) = w_i\beta + z_i'\gamma \] where \(x_i=(w_i,z_i')'\), \(w_i\in \mathbb{R}\), \(z_i\in\mathbb{R}^k\)

Interested in \(\beta\) (“treatment effect of \(w_i\) after controlling for \(z_i\)”)

Assume \(\|\gamma\|\le C\) for some norm \(\|\cdot\|\) and constant \(C\) (e.g. \(\|\gamma\|=\sum_{j=1}^k |\gamma_j|\)).

Falls into our framework with \(Lf=\beta=\frac{d}{dw}f(w,z)\) and \[ \mathcal{F}=\{f(w,z)=w\beta+z'\gamma: \|\gamma\|\le C\} \]

Common motivation for assumption \(\|\gamma\|\le C\): if \(k+1>n\) (more regressors than observations), we can’t do anything without some additional assumptions (OLS not even defined)

Thus the heading “high dimensional controls”

But what if \(k\ll n\) (say \(k=2\) or \(k=3\))?

We might still gain efficiency by assuming a bound \(\|\gamma\|\le C\). How to do so optimally?

Same as “high dimensional controls”, but replace \(z_i'\gamma\) with \(g(z_i)\): \[ f(w_i,z_i)=w_i\beta + g(z_i) \]

Still interested in \(Lf=\beta=\frac{d}{dw}f(w,z)\).

Specify a smoothness class for \(g(\cdot)\) as in nonparametric regression (\(p\)th derivative bounded by \(C\)). Falls into our framework with \[ \mathcal{F}=\{f: f(w,z)=w\beta+g(z),\, |g^{(p)}(z)|\le C \text{ a.e.}\} \]

Main takeaway: suppose

- \(Lf\) is a
*linear*functional (usual definition of linear transformation: \(L(a f+g)=a Lf+Lg\)) - \(\mathcal{F}\) is a
*convex*set (usual definition: \(f,g\in\mathcal{F}\), \(\lambda\in[0,1]\) implies \(\lambda f+(1-\lambda)g\in\mathcal{F}\)) - \(\mathcal{F}\) is
*centrosymmetric*(\(f\in\mathcal{F}\) implies \(-f\in\mathcal{F}\))

- \(Lf\) is a
These conditions hold in all of the above examples

Then

*minimax linear estimators*and*bias-aware fixed-length CIs*are (near) optimalWe now describe these procedures (skipping optimality theory), using nonparametric regression as an example.

A

*linear estimator*\(\hat L\) is a linear function of \(Y=(Y_1,\ldots,Y_n)'\): \[ \hat L f = w'Y = \sum_{i=1}^n w_i Y_i \] where \(w=w(X)\) is a \(n\times 1\) vector of*estimation weights*that can depend on the covariates \(X=(x_1,\ldots,x_n)'\).Example: linear regression model \(f(x_i)=x_i'\beta\), estimate \(Lf=\beta_1\) (the first coefficient) using OLS: \[ \hat L = e_1'(X'X)^{-1}X'Y \quad\text{where} \] where \(e_1=(1,0,\ldots,0)\).

This is a linear estimator with \(w'=e_1'(X'X)^{-1}X'\).

Recall our model: interested in \(Lf\) where \[ Y_i = f(x_i) + u_i, \quad E(u_i)=0 \] \(x_i\)s are nonrandom.

Use linearity to get bias of \(\hat L=\sum_{i=1}^n w_iY_i\): \[ E_f \hat L - Lf = \sum_{i=1}^n w_iE_fY_i - Lf = \sum_{i=1}^n w_i f(x_i) - Lf =: \operatorname{bias}_f(w) \]

For finite sample results, assume \(u_i\stackrel{iid}{\sim} N(0,\sigma^2)\). Then \(\operatorname{var}(\hat L) = \sigma^2 w'w\) and \[ \hat L - Lf \sim N\left(\operatorname{bias}_f(w), \operatorname{var}(\hat L)\right). \]

Can’t compute bias since \(f\) is unknown. However, under our assumption \(f\in\mathcal{F}\), we can compute

*worst-case bias*: \[ \overline{\operatorname{bias}}_{\mathcal{F}}(w) = \sup_{f\in\mathcal{F}} |\operatorname{bias}_f(w)| \]Consider \(z\)-statistic \[ \frac{\hat L - Lf}{\operatorname{sd}(\hat L)} \sim N(t,1) \quad \text{where} \quad |t|\le \frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)} \]

Let \(\operatorname{cv}_\alpha(t)= \left\{ 1-\alpha \text{ quantile of } |N(t,1)| \right\}\). Leads to bias-aware fixed length CI: \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \]

Feasible, heteroskedasticity robust CI: replace \(\operatorname{sd}(\hat L)\) with a robust standard error

Optimal weights \(w\) depend on the criterion. For CI, can optimize length: \[ 2\cdot \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \] (optimized \(w\) still leads to valid CI, since it doesn’t depend on \(Y\)).

For estimation, consider worst-case mean squared error (MSE): \[ \sup_{f\in\mathcal{F}} E_f\left(\hat L-Lf\right)^2 =\overline{\operatorname{bias}}_\mathcal{F}(w)^2 + \operatorname{var}(\hat L) \] or absolute error, etc.

If \(Lf\) gives net welfare from implementing a policy, can optimize worst-case welfare regret as advocated by Manski (see Yata (2021), Ishihara and Kitagawa (2021))

All of these criteria are increasing in \(\overline{\operatorname{bias}}_\mathcal{F}(w)^2\) and \(\operatorname{var}(\hat L)\). Thus, to find optimal weights \(w\):

For each \(B\ge 0\), solve \[ \min \operatorname{var}(\hat L) \quad\text{s.t.}\quad \overline{\operatorname{bias}}_\mathcal{F}(w) \le B \] where \[ \operatorname{var}(\hat L) = \sigma^2 w'w, \quad \overline{\operatorname{bias}}_\mathcal{F}(w) = \max_{f\in\mathcal{F}} \left| \sum_{i=1}^n w_i f(x_i) - Lf \right| \]

Find the \(B\) that minimizes the given criterion (CI length, MSE, welfare regret, etc.)

The resulting estimator is a

*minimax linear estimator*: it is the linear estimator that minimizes worst-case performance.There is a general formulation of Step 1 as a convex optimization problem (see Donoho (1994), Armstrong and Kolesár (2018)).

Interested in \(Lf=f(x_0)\) for some given \(x_0\). We’ll normalize \(x_0=0\) for notational convenience.

Assume that \(p\)th derivative of \(f(x)\) is bounded by \(C\).

When \(p=1\), this is a

*Lipschitz*constraint: \[ \mathcal{F}_{\text{Lip}} = \{f: |f'(x)|\le C \text{ a.e.}\} \]For simplicity, we’ll consider a relaxed version of this constraint used by Sacks and Ylvisaker (1978): \[ \mathcal{F}_{\text{SY},1} = \{f: |f(x)-f(0)|\le C |x| \text{ all }x\in\mathbb{R}\} \] (turns out the results are the same in this case).

Bias of linear estimator \(\hat L=w'Y\): \[ \operatorname{bias}_f(w) = \sum_{i=1}^n w_i f(x_i) - f(0) \]

First, note that we must have \(\sum_{i=1}^n w_i=1\) to have \(\overline{\operatorname{bias}}_f(w)<\infty\). Otherwise, bias can be arbitrarily large when \(f(x)=c\) for large \(c\).

Incorporating this constraint, we can write the bias as \(\sum_{i=1}^n w_i [f(x_i) - f(0)]\). Worst-case bias: \[\begin{align*} \overline{\operatorname{bias}}_{\mathcal{F}_{\text{SY},1}}(w) &= \sup \sum_{i=1}^n w_i [f(x_i) - f(0)] \quad\text{s.t.}\quad |f(x) - f(0)|\le C|x| \text{ all }x \\ &= C \sum_{i=1}^n |w_i|\cdot |x_i| \end{align*}\] Last step follows by noting that \(f(x_i)=C |x_i|\cdot \operatorname{sign}(w_i)\) maximizes each term in the sum.

Variance is \(\sigma^2 w'w=\sigma^2\sum_{i=1}^n w_i^2\). Incorporating constraint \(\sum_{i=1}^n w_i=1\) leads to optimization problem for optimal weights: \[ \min \sum_{i=1}^n w_i^2 \quad\text{s.t.}\quad C \sum_{i=1}^n |w_i|\cdot |x_i|\le B, \, \sum_{i=1}^n w_i=1. \]

Lagrangian: \[ \mathcal{L}=\frac{1}{2}\sum_{i=1}^n w_i^2 +\lambda \sum_{i=1}^n |w_i|\cdot |x_i| +\mu \left( 1-\sum_{i=1}^n w_i \right) \]

FOC for \(w_i\): \[\begin{align*} &0 = \frac{d \mathcal{L}}{d w_i} = w_i + \lambda |x_i|\cdot \text{sign}(w_i) - \mu \\ &\Longrightarrow w_i = \mu - \lambda |x_i|\cdot \text{sign}(w_i) \end{align*}\] where we use the convention that \(\operatorname{sign}(w_i)\) can be anything in \([-1,1]\) when \(w_i=0\).

Three possibilities:

- \(w_i>0 \Longrightarrow w_i = \mu - \lambda |x_i|\)
- \(w_i=0 \Longrightarrow\) there exists \(c\in [-1,1]\) such that \(0 = \mu - \lambda |x_i|\cdot c\) \(\Longrightarrow\) \(\mu-\lambda |x_i|\le 0\)
- \(w_i<0\) cannot hold for optimal \(w_i\) (can make objective smaller by setting negative \(w_i\) to 0 and decreasing one of the positive weights)

It follows that optimal weights \(w_i^*\) satisfy \(w_i^*=\max\{\mu-\lambda |x_i|,0\}\).

Incorporating constraint \(\sum_{i=1}^n w_i=1\) yields optimal weights \[ w_i^* = \frac{\max\{\mu-\lambda |x_i|,0\}}{\sum_{j=1}^n\max\{\mu-\lambda |x_j|,0\}} \]

Resulting estimator is a

*Nadaraya-Watson*estimator: \[ \sum_{i=1}^n w_i^* Y_i = \frac{\sum_{i=1}^n Y_i k_{\operatorname{tri}}(x_i/h)}{\sum_{i=1}^n k_{\operatorname{tri}}(x_i/h)} \] where \(k_{\operatorname{tri}}(t)=\max\left\{1-|t|,0\right\}\) is the*triangular kernel*and \(h=\mu/\lambda\) is the*bandwidth*.Optimal bandwidth depends on constraint \(B\) on bias through Lagrange multipliers \(\mu\) and \(\lambda\).

To allow for non-normal, heteroskedastic errors (and serial/cluster correlation, etc.), replace \(\operatorname{sd}(\hat L)\) with a robust standard error for \(\hat L=\sum_{i=1}^n w_iY_i\): \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{se}_{\operatorname{rob}}}\right)\cdot \operatorname{se}_{\operatorname{rob}} \]

For example, if we assume independence but allow heteroskedasticity, we can use \(\operatorname{se}_{\operatorname{rob}} = \sum_{i=1}^n w_i^2 \hat u_i^2\) where \(\hat u_i\) are residuals.

Note that the bound \(\overline{\operatorname{bias}}_{\mathcal{F}}(w)\) on bias is valid in finite samples, so we avoid asymptotic approximations that are taboo in the “low dimensional culture”

This feasible CI is

*honest*in the sense of Li (1989): it is asymptotically valid uniformly over \(f\in\mathcal{F}\) (and over sufficiently regular distributions for \(u_i\))- This honesty property is sometimes called “uniformity in the underlying distribution”

To show this, suffices to show Lindeberg condition for weights \(w_i\). This gives central limit theorem for \(\hat L=\sum_{i=1}^n w_iY_i\).

Lindeberg condition holds for optimal weights under some general coditions in the nonparametric setting (see Armstrong and Kolesár (2018), Armstrong and Kolesár (2021b)).

Intuition: if one \(w_i\) dominates, one can reduce variance with little change in bias by making it closer to weights for nearby \(x_i\)’s

In some cases, exact optimal weights are complicated

For example, consider Hölder class with bound on second derivative (popular in RD). No closed form for optimal weights in finite samples. There are asymptotically optimal kernels, but they are also a bit cumbersome (Zhao 1997; Gao 2018).

Often, some simple class of weights is near-optimal. For example, local linear estimators with triangular (or even uniform) kernel are asymptotically near-optimal under second derivative bound (see Cheng, Fan, and Marron (1997), Armstrong and Kolesár (2020)), and there is a simple formula for worst-case bias in finite samples (see Armstrong and Kolesár (2020)).

The

`RDHonest`

software package uses this approach to compute bias-aware CIs for regression discontinuity in R and Stata.One also can check this in finite-samples in a particular application (see Armstrong and Kolesár (2018) for an example).

Optimal kernels for relaxed version of Hölder class: Sacks and Ylvisaker (1978), Armstrong and Kolesár (2018)

Optimal weights in Lipschitz/Hölder class using convex optimization software: Armstrong and Kolesár (2021a) (for ATE), Imbens and Wager (2019) (for RD)

Asymptotically optimal kernels: Zhao (1997), Gao (2018), references in Lepski and Tsybakov (2000)

High dimensional/semiparametric controls: Armstrong, Kolesár, and Kwon (2020) (resulting estimator similar to Robinson (1988))

Sparsity in high dimensions: nonconvex, but Javanmard and Montanari (2014) use minimax linear estimation based on convex bound on initial estimation error

Misspecified GMM (Armstrong and Kolesár 2021b)

Density estimation (covered in Armstrong and Kolesár (2020))

Deconvolution (Ignatiadis and Wager 2021)

Diff-in-diff (Rambachan and Roth 2019)

Weak IV robust fuzzy RD (Noack and Rothe 2019)

We have restricted attention to linear procedures, but it can be shown that the minimax linear procedure is (near) minimax among

*all*procedures (Donoho (1994), Armstrong and Kolesár (2018), Yata (2021))Optimal estimator and CI require knowledge of bound \(C\) on derivative to choose weights and form CI.

Can we avoid this?

For inference,

*no*(see Low (1997), Armstrong and Kolesár (2018))For estimation, typically lose \(\log n\) term (see Lepskii (1991))

Intuition: can’t estimate bias without auxiliary information

Auxiliary assumptions (need to lead to nonconvex or asymmetric parameter space):

- “Self-similarity” assumptions proposed by Giné and Nickl (2010) (although this introduces additional adaptation issues; see Armstrong (2021))
- Global polynomial “rule of thumb” (Armstrong and Kolesár 2020)

Asymmetric parameter space (can allow for some degree of adaptation depending on the setting)

With many parameters: average coverage

What if \(\mathcal{F}\) is convex but asymmetric (e.g. monotonicity, sign constraints)? Consider minimax

*affine*estimator \(\hat L=w_0+w'Y\).Convex optimization problem similar to symmetric case. Set \(w_0\) to equalize worst-case upward and downward bias.

Minimax affine estimators still near minimax among all estimators (Donoho 1994).

In contrast to centrosymmetric case,

*some degree of adaptation for CIs is possible*. See, among others, Cai and Low (2004), Armstrong and Kolesár (2018), Armstrong (2015), Kwon and Kwon (2020a), Kwon and Kwon (2020b)Intuition: use shape constraints to bound bias of affine estimators.

Biased estimates \(\hat \theta_j\) of \(\theta_j\) for \(j=1,\ldots,J\): \[ E\hat\theta_j-\theta_j = \operatorname{bias}_j, \quad \operatorname{var}\left(\hat\theta_j\right) = \operatorname{se}_j^2 \]

Consider

*compound MSE*: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}_j^2\right) \]In many cases, compound loss

*makes adaptation possible*Why? Often possible to consistently estimate \(\frac{1}{J}\sum_{j=1}^n \operatorname{bias}_j^2\) as \(J\to\infty\) (using, e.g. cross-validation), but not when \(J=1\).

What is compound loss for CIs? Consider

*average coverage*criterion at level \(\alpha\): \[ \frac{1}{J}\sum_{j=1}^J P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]Weaker than

*classical coverage*, which requires \[ \max_{1\le j \le J} P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]We can “break” impossibility of adaptation by only requiring average coverage.

Unbiased estimate \(Y_j\) of parameter \(\theta_j\) for \(j=1,\ldots,J\) (e.g. fixed effects)

To reduce variance, regress \(Y_j\) on covariates \(X_j\) and “shrink” towards regression estimate \(X_j'\hat\beta\): \[ \hat\theta_j = w Y_j + (1-w) X_j'\hat\beta \]

Shrinkage weight \(w\) trades off bias vs variance:

- \(w=1\Longrightarrow\) zero bias, large variance
- \(w=0\Longrightarrow\) large bias, zero variance

For simplicity, assume \(X_j'\hat\beta=0\) and that \(Y_j\) is exactly normal with same variance \(\sigma^2\) for all \(j\): \[ Y_j \sim N(\theta_j,\sigma^2) \] with \(\sigma^2\) known

Shrinkage estimator \(\hat\theta_j= w Y_j\sim N(\theta_j+\operatorname{bias}_j,\operatorname{se}^2)\) where \(\operatorname{bias}_j=w\theta_j-\theta_j=(w-1)\theta_j\) and \(\operatorname{se}^2=w^2\sigma^2\).

Compound MSE: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}^2\right) = (w-1)^2 \frac{1}{J} \sum_{j=1}^J \theta_j^2 + w^2\sigma^2 \]

Compound MSE is \((w-1)^2 \mu_{2,J} + w^2\sigma^2\) where \(\mu_{2,J}=\frac{1}{J} \sum_{j=1}^J \theta_j^2\).

If \(\mu_{2,J}\) is known, can minimize compound MSE over \(w\) to get \[ w_{\operatorname{opt}}=\frac{\mu_{2,J}}{\mu_{2,J}+\sigma^2} \]

Estimate of \(\mu_{2,J}\): \[ \hat\mu_{2,J} = \frac{1}{J}\sum_{j=1}^J Y_j^2 - \sigma^2. \] (consistent as \(J\to\infty\)).

Plugging in \(\hat\mu_{2,J}\) (with possible modifications for finite sample performance) gives a

*Stein estimator*, an example of an*empirical Bayes*estimator

The Stein estimator is

*adaptive*to \(\mu_{2,J}\): it achieves the same compound MSE (as \(j\to\infty\)) as an estimator that uses prior knowledge of \(\mu_{2,J}\)Empirical Bayes motivation: \(\theta_j\)’s drawn from same prior \(N(0,\mu_2)\), \(\hat\mu_{2,J}\) is an estimate of \(\mu_2\). The empirical Bayes estimate is adaptive to the unknown prior variance \(\mu_2\).

- This requires slightly different notation from what we’re using here (to get our current notation from a Bayesian setup, we’d condition on \(\theta\))

How can we use this estimator to get an adaptive CI (or average coverage interval)? We will describe an approach from Armstrong, Kolesár, and Plagborg-Møller (2022).

Consider CIs based on shrinkage estimator \(\hat\theta_j=wY_j\): \[ \operatorname{CI}_j=\left\{\hat\theta_j \pm \chi \cdot \operatorname{se}\right\} \] where \(\chi\) is some critical value.

Basic idea: use \(\hat\mu_{2,J}\) to estimate a bound on average non-coverage.

First look at individual non-coverage. Note that \((\hat\theta_j-\theta_j)/\operatorname{se}\stackrel{d}{=}Z+\text{bias}_j/\operatorname{se}\) where \(Z\sim N(0,1)\). Thus, letting \(b_j=\text{bias}_j/\operatorname{se}\), we have \[\begin{align*} &P\left(\theta_j\notin \operatorname{CI}_j\right) =P\left(\left|\frac{\hat\theta_j-\theta_j}{\operatorname{se}}\right|\ge \chi\right) = P\left(\left|Z+b_j\right|\ge \chi\right) \\ &= \Phi\left(-\chi-b_j\right) + \Phi\left(-\chi+b_j\right) =: r(b_j,\chi). \end{align*}\]

Thus, average non-coverage is \[ \frac{1}{J}\sum_{j=1}^J P\left(\theta_j\notin \operatorname{CI}_j\right) =\frac{1}{J}\sum_{j=1}^J r(b_j,\chi) =E_{b\sim F_J} r(b,\chi) \] where \(F_J\) is the empirical distribution of \(b_j=\operatorname{bias}_j/\operatorname{se}=[(w-1)\theta_j]/(w\sigma^2)\) over \(j=1,\ldots, J\).

How can we bound this? Note that we can use \(\mu_{2,J}\) to compute \(E_{b\sim F_J} b^2\): \[ E_{b\sim F_J} b^2 = \frac{1}{J}\sum_{j=1}^J \frac{(w-1)^2}{w^2\sigma^2}\theta_j^2 = \frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J} \]

Thus, average non-coverage is bounded by \(\rho\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J},\chi\right)\) where \[ \rho\left(t,\chi\right) :=\sup E_{b\sim F} r(b_j,\chi) \quad\text{s.t.}\quad E_F b^2 \le t. \]

Let \(\operatorname{cva}_\alpha(t)=\rho^{-1}(t,\alpha)\) where the inverse is with respect to the second argument. Setting \(\chi=\operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right)\) gives the interval \[ \operatorname{CI}_j=\hat\theta_j \pm \operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right) \operatorname{se} \]

Feasible average coverage interval: plug in consistent estimate \(\hat\mu_{2,J}\) of \(\mu_{2,J}\).

To \(\operatorname{cva}_\alpha(t)\) we need to compute \(\rho(t,\alpha)\). This is a convex program in a one-dimensional probability distribution \(F\), which can be solved by gridding over \(F\). Actually, it turns out that there is a closed-form solution (see Armstrong, Kolesár, and Plagborg-Møller (2022)).

Implemented in R, Matlab and Stata packages.

The shrinkage estimator \(\hat\theta_j=wY_j\) with the plug-in optimal weight \(w=\hat\mu_{2,J}/(\hat\mu_{2,J}+\sigma^2)\) has an empirical Bayes interpretation:

- Assume \(\theta_j\)’s drawn from common prior \(N(0,\mu_2)\) with unknown \(\mu_2\).
- Estimate \(\mu_2\) and use this to form optimal Bayesian estimate (posterior mean)

What about the average coverage intervals we constructed? They are

*empirical Bayes confidence intervals*(EBCIs) in the sense of Morris (1983): they have ex-ante coverage at least \(1-\alpha\) over the prior distribution of \(\theta_j\).If the \(N(0,\mu_2)\) prior is correct, one can use a

*parametric EBCI*by plugging \(\hat\mu_{2,J}\) into Bayesian credible interval (see Morris (1983)).In contrast, the intervals we constructed are

*robust*to failure of \(N(0,\mu_2)\) prior.

Average coverage (and related notions of coverage) originated in nonparametric regression literature.

Ideas described above extend to other settings where estimate of average squared bias-sd ratio is available

Multiple testing: certain multiple testing adjustments still yield false discovery rate (FDR) control when applied to average coverage controlling CIs/tests (Armstrong (2022))

Thus, we can “break” impossibility of adaptation results by requiring FDR control instead of classical size control for each test

`RDHonest`

(package for R and Stata): bias-aware CIs in regression discontinuity`ATEHonest`

(package for R): average treatment effects under unconfoundedness`ebci`

/`ebreg`

(package for R, Stata and Matlab): empirical Bayes confidence intervals

Armstrong, Timothy B. 2015. “Adaptive Testing on a Regression
Function at a Point.” *The Annals of Statistics* 43 (5):
2086–2101. https://doi.org/10.1214/15-AOS1342.

———. 2021. “Adaptation Bounds for Confidence Bands Under
Self-Similarity.” *Bernoulli* 27 (2): 1348–70. https://doi.org/10.3150/20-BEJ1277.

———. 2022. “False Discovery Rate
Adjustments for Average
Significance Level Controlling
Tests.” arXiv. https://doi.org/10.48550/arXiv.2209.13686.

Armstrong, Timothy B., and Michal Kolesár. 2018. “Optimal
Inference in a Class of
Regression Models.”
*Econometrica* 86 (2): 655–83. https://doi.org/10.3982/ECTA14434.

———. 2020. “Simple and Honest Confidence Intervals in
Nonparametric Regression.” *Quantitative Economics* 11
(1): 1–39. https://doi.org/10.3982/QE1199.

———. 2021a. “Finite-Sample Optimal
Estimation and Inference on
Average Treatment Effects
Under Unconfoundedness.”
*Econometrica* 89 (3): 1141–77. https://doi.org/10.3982/ECTA16907.

———. 2021b. “Sensitivity Analysis Using Approximate Moment
Condition Models.” *Quantitative Economics* 12 (1):
77-108-108. https://doi.org/10.3982/TE1559.

Armstrong, Timothy B., Michal Kolesár, and Soonwoo Kwon. 2020.
“Bias-Aware Inference in
Regularized Regression
Models.” *arXiv:2012.14823 [Econ, Stat]*,
December. http://arxiv.org/abs/2012.14823.

Armstrong, Timothy B., Michal Kolesár, and Mikkel Plagborg-Møller. 2022.
“Robust Empirical Bayes
Confidence Intervals.”
*Econometrica* 90 (6): 2567–2602. https://doi.org/10.3982/ECTA18597.

Breiman, Leo. 2001. “Statistical Modeling:
The Two Cultures (with Comments
and a Rejoinder by the Author).” *Statistical Science* 16
(3): 199–231. https://doi.org/10.1214/ss/1009213726.

Cai, T. Tony, and Mark G. Low. 2004. “An Adaptation
Theory for Nonparametric
Confidence Intervals.” *The Annals
of Statistics* 32 (5): 1805–40. http://www.jstor.org/stable/3448556.

Cheng, Ming-Yen, Jianqing Fan, and J. S. Marron. 1997. “On
Automatic Boundary Corrections.” *The Annals of
Statistics* 25 (4): 1691–1708. https://doi.org/10.1214/aos/1031594737.

Donoho, David L. 1994. “Statistical Estimation and
Optimal Recovery.” *The Annals of
Statistics* 22 (1): 238–70. https://doi.org/10.1214/aos/1176325367.

Gao, Wayne Yuan. 2018. “Minimax Linear Estimation at a Boundary
Point.” *Journal of Multivariate Analysis* 165 (May):
262–69. https://doi.org/10.1016/j.jmva.2018.01.001.

Giné, Evarist, and Richard Nickl. 2010. “Confidence Bands in
Density Estimation.” *The Annals of Statistics* 38 (2):
1122–70. https://doi.org/10.1214/09-AOS738.

Ignatiadis, Nikolaos, and Stefan Wager. 2021. “Confidence
Intervals for Nonparametric
Empirical Bayes Analysis.”
*Journal of the American Statistical Association* 0 (0): 1–18. https://doi.org/10.1080/01621459.2021.2008403.

Imbens, Guido, and Stefan Wager. 2019. “Optimized
Regression Discontinuity
Designs.” *The Review of Economics and
Statistics* 101 (2): 264–78. https://doi.org/10.1162/rest_a_00793.

Ishihara, Takuya, and Toru Kitagawa. 2021. “Evidence
Aggregation for Treatment
Choice.”

Javanmard, Adel, and Andrea Montanari. 2014. “Confidence
Intervals and Hypothesis Testing
for High-Dimensional
Regression.” *Journal of Machine Learning
Research* 15 (82): 2869–909. http://jmlr.org/papers/v15/javanmard14a.html.

Kwon, Koohyun, and Soonwoo Kwon. 2020a. “Adaptive
Inference in Multivariate
Nonparametric Regression Models
Under Monotonicity.”
*arXiv:2011.14219 [Econ, Math, Stat]*, November. http://arxiv.org/abs/2011.14219.

———. 2020b. “Inference in Regression
Discontinuity Designs Under
Monotonicity.” *arXiv:2011.14216 [Econ,
Stat]*, November. http://arxiv.org/abs/2011.14216.

Lepski, O. V., and A. B. Tsybakov. 2000. “Asymptotically Exact
Nonparametric Hypothesis Testing in Sup-Norm and at a Fixed
Point.” *Probability Theory and Related Fields* 117 (1):
17–48. https://doi.org/10.1007/s004400050265.

Lepskii, O. 1991. “On a Problem of
Adaptive Estimation in Gaussian
White Noise.” *Theory of Probability
& Its Applications* 35 (3): 454–66. https://doi.org/10.1137/1135065.

Li, Ker-Chau. 1989. “Honest Confidence
Regions for Nonparametric
Regression.” *The Annals of Statistics* 17
(3): 1001–8. https://doi.org/10.1214/aos/1176347253.

Low, Mark G. 1997. “On Nonparametric Confidence Intervals.”
*The Annals of Statistics* 25 (6): 2547–54. https://doi.org/10.1214/aos/1030741084.

Morris, Carl N. 1983. “Parametric Empirical
Bayes Confidence
Intervals.” In *Scientific
Inference, Data Analysis, and
Robustness*, edited by G. E. P. Box, Tom Leonard, and
Chien-Fu Wu, 25–50. Academic Press. https://doi.org/10.1016/B978-0-12-121160-8.50008-9.

Noack, Claudia, and Christoph Rothe. 2019.
“Bias-Aware Inference in
Fuzzy Regression Discontinuity
Designs.” *arXiv:1906.04631 [Econ, Stat]*,
June. http://arxiv.org/abs/1906.04631.

Rambachan, Ashesh, and Jonathan Roth. 2019. “An
Honest Approach to Parallel
Trends.”

Robinson, P. M. 1988. “Root-N-Consistent
Semiparametric Regression.”
*Econometrica* 56 (4): 931–54. https://doi.org/10.2307/1912705.

Sacks, Jerome, and Donald Ylvisaker. 1978. “Linear
Estimation for Approximately
Linear Models.” *The Annals of
Statistics* 6 (5): 1122–37. https://doi.org/10.1214/aos/1176344315.

Yata, Kohei. 2021. “Optimal Decision
Rules Under Partial
Identification.” arXiv. https://doi.org/10.48550/arXiv.2111.04926.

Zhao, Linda H. 1997. “Minimax Linear Estimation in a White Noise
Problem.” *The Annals of Statistics* 25 (2): 745–55. https://doi.org/10.1214/aos/1031833671.