Tim Armstrong
April 2023
High dimensional problem \(\Rightarrow\) regularized estimator \(\Rightarrow\) bias-variance tradeoff
Want to choose tuning parameter to optimize bias-variance tradeoff (for MSE, CI length, etc.)
We’ll present a general theory for doing this in finite samples using minimax linear (or affine) estimators and bias-aware fixed-length CIs
Gives (near) optimal estimates and CIs when the parameter space is convex and centrosymmetric (some results extend to asymmetry)
Finite sample results use idealized setting (normal errors, etc.). Can use this to show an asymptotic property called honesty when normality is relaxed.
Bias-variance tradeoff depends on “regularity parameter” (bound on derivative or norm/number of regressors, etc.)
Problem of adaptation: can we solve this problem automatically without prior knowledge of the regularity parameter?
For inference on a single parameter: no
For inference on a many parameters: sometimes yes, but with a different notion of coverage
As a graduate student, I derived optimality results for conditional moment inequalities using tools from the “high dimensional culture” (bias-variance tradeoffs, optimal rates)
People working in the moment inequality literature were resistant: they were used to the low dimensional paradigm, and didn’t think that ideas from the high dimensional paradigm could be relevant.
Some of the critiques I heard were tantamount to rejecting high dimensional and nonparametric statistics entirely (e.g. “I don’t think bias-variance tradeoffs from nonparametric statistics are relevant in finite samples”)
I started working on finite sample nonparametrics in order to address these critiques. I discovered that there was indeed some work on this that had mostly been forgotten or ignored.
Observe \(\{(x_i,Y_i)_{i=1}^n\}\) with \[ Y_i = f(x_i) + u_i,\quad E(u_i)=0 \]
Treat \(x_1,\ldots,x_n\) as fixed (if they are random, replace \(E(\cdot)\) with \(E(\cdot |x_1,\ldots,x_n)\)).
Subscript to denote dependence on \(f\) when needed: e.g. \(E_f Y_i=f(x_i)\).
Specify a (possibly infinite dimensional) parameter space \(\mathcal{F}\) such that \(f\) is assumed to be in \(\mathcal{F}\).
Interested in a scalar parameter \(Lf\) that is a linear functional of \(f\).
All other aspects of the regression function \(f\) other than \(Lf\) are nuisance parameters
Consider case where \(x_i\in\mathbb{R}\) for simplicity
No parametric assumptions on \(f\), just assume that it’s “smooth”
To formalize “smooth”, specify some order \(p\) of the derivative and a bound \(C\) such that the \(p\)th derivative is bounded by \(C\): \[ \mathcal{F}=\{f: |f^{p}(x)|\le C \text{ a.e.}\} \] This is called a Hölder regularity class.
Interested in \(Lf=f(x_0)\) (or \(Lf=f'(x_0)\), etc.) where \(x_0\) is some prespecified point
Popular in regression discontinuity (difference of \(f\) at two boundary points)
Linear model \[ f(w_i,z_i) = w_i\beta + z_i'\gamma \] where \(x_i=(w_i,z_i')'\), \(w_i\in \mathbb{R}\), \(z_i\in\mathbb{R}^k\)
Interested in \(\beta\) (“treatment effect of \(w_i\) after controlling for \(z_i\)”)
Assume \(\|\gamma\|\le C\) for some norm \(\|\cdot\|\) and constant \(C\) (e.g. \(\|\gamma\|=\sum_{j=1}^k |\gamma_j|\)).
Falls into our framework with \(Lf=\beta=\frac{d}{dw}f(w,z)\) and \[ \mathcal{F}=\{f(w,z)=w\beta+z'\gamma: \|\gamma\|\le C\} \]
Common motivation for assumption \(\|\gamma\|\le C\): if \(k+1>n\) (more regressors than observations), we can’t do anything without some additional assumptions (OLS not even defined)
Thus the heading “high dimensional controls”
But what if \(k\ll n\) (say \(k=2\) or \(k=3\))?
We might still gain efficiency by assuming a bound \(\|\gamma\|\le C\). How to do so optimally?
Same as “high dimensional controls”, but replace \(z_i'\gamma\) with \(g(z_i)\): \[ f(w_i,z_i)=w_i\beta + g(z_i) \]
Still interested in \(Lf=\beta=\frac{d}{dw}f(w,z)\).
Specify a smoothness class for \(g(\cdot)\) as in nonparametric regression (\(p\)th derivative bounded by \(C\)). Falls into our framework with \[ \mathcal{F}=\{f: f(w,z)=w\beta+g(z),\, |g^{(p)}(z)|\le C \text{ a.e.}\} \]
Main takeaway: suppose
These conditions hold in all of the above examples
Then minimax linear estimators and bias-aware fixed-length CIs are (near) optimal
We now describe these procedures (skipping optimality theory), using nonparametric regression as an example.
A linear estimator \(\hat L\) is a linear function of \(Y=(Y_1,\ldots,Y_n)'\): \[ \hat L f = w'Y = \sum_{i=1}^n w_i Y_i \] where \(w=w(X)\) is a \(n\times 1\) vector of estimation weights that can depend on the covariates \(X=(x_1,\ldots,x_n)'\).
Example: linear regression model \(f(x_i)=x_i'\beta\), estimate \(Lf=\beta_1\) (the first coefficient) using OLS: \[ \hat L = e_1'(X'X)^{-1}X'Y \quad\text{where} \] where \(e_1=(1,0,\ldots,0)\).
This is a linear estimator with \(w'=e_1'(X'X)^{-1}X'\).
Recall our model: interested in \(Lf\) where \[ Y_i = f(x_i) + u_i, \quad E(u_i)=0 \] \(x_i\)s are nonrandom.
Use linearity to get bias of \(\hat L=\sum_{i=1}^n w_iY_i\): \[ E_f \hat L - Lf = \sum_{i=1}^n w_iE_fY_i - Lf = \sum_{i=1}^n w_i f(x_i) - Lf =: \operatorname{bias}_f(w) \]
For finite sample results, assume \(u_i\stackrel{iid}{\sim} N(0,\sigma^2)\). Then \(\operatorname{var}(\hat L) = \sigma^2 w'w\) and \[ \hat L - Lf \sim N\left(\operatorname{bias}_f(w), \operatorname{var}(\hat L)\right). \]
Can’t compute bias since \(f\) is unknown. However, under our assumption \(f\in\mathcal{F}\), we can compute worst-case bias: \[ \overline{\operatorname{bias}}_{\mathcal{F}}(w) = \sup_{f\in\mathcal{F}} |\operatorname{bias}_f(w)| \]
Consider \(z\)-statistic \[ \frac{\hat L - Lf}{\operatorname{sd}(\hat L)} \sim N(t,1) \quad \text{where} \quad |t|\le \frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)} \]
Let \(\operatorname{cv}_\alpha(t)= \left\{ 1-\alpha \text{ quantile of } |N(t,1)| \right\}\). Leads to bias-aware fixed length CI: \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \]
Feasible, heteroskedasticity robust CI: replace \(\operatorname{sd}(\hat L)\) with a robust standard error
Optimal weights \(w\) depend on the criterion. For CI, can optimize length: \[ 2\cdot \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{sd}(\hat L)}\right)\cdot \operatorname{sd}(\hat L) \] (optimized \(w\) still leads to valid CI, since it doesn’t depend on \(Y\)).
For estimation, consider worst-case mean squared error (MSE): \[ \sup_{f\in\mathcal{F}} E_f\left(\hat L-Lf\right)^2 =\overline{\operatorname{bias}}_\mathcal{F}(w)^2 + \operatorname{var}(\hat L) \] or absolute error, etc.
If \(Lf\) gives net welfare from implementing a policy, can optimize worst-case welfare regret as advocated by Manski (see Yata (2021), Ishihara and Kitagawa (2021))
All of these criteria are increasing in \(\overline{\operatorname{bias}}_\mathcal{F}(w)^2\) and \(\operatorname{var}(\hat L)\). Thus, to find optimal weights \(w\):
For each \(B\ge 0\), solve \[ \min \operatorname{var}(\hat L) \quad\text{s.t.}\quad \overline{\operatorname{bias}}_\mathcal{F}(w) \le B \] where \[ \operatorname{var}(\hat L) = \sigma^2 w'w, \quad \overline{\operatorname{bias}}_\mathcal{F}(w) = \max_{f\in\mathcal{F}} \left| \sum_{i=1}^n w_i f(x_i) - Lf \right| \]
Find the \(B\) that minimizes the given criterion (CI length, MSE, welfare regret, etc.)
The resulting estimator is a minimax linear estimator: it is the linear estimator that minimizes worst-case performance.
There is a general formulation of Step 1 as a convex optimization problem (see Donoho (1994), Armstrong and Kolesár (2018)).
Interested in \(Lf=f(x_0)\) for some given \(x_0\). We’ll normalize \(x_0=0\) for notational convenience.
Assume that \(p\)th derivative of \(f(x)\) is bounded by \(C\).
When \(p=1\), this is a Lipschitz constraint: \[ \mathcal{F}_{\text{Lip}} = \{f: |f'(x)|\le C \text{ a.e.}\} \]
For simplicity, we’ll consider a relaxed version of this constraint used by Sacks and Ylvisaker (1978): \[ \mathcal{F}_{\text{SY},1} = \{f: |f(x)-f(0)|\le C |x| \text{ all }x\in\mathbb{R}\} \] (turns out the results are the same in this case).
Bias of linear estimator \(\hat L=w'Y\): \[ \operatorname{bias}_f(w) = \sum_{i=1}^n w_i f(x_i) - f(0) \]
First, note that we must have \(\sum_{i=1}^n w_i=1\) to have \(\overline{\operatorname{bias}}_f(w)<\infty\). Otherwise, bias can be arbitrarily large when \(f(x)=c\) for large \(c\).
Incorporating this constraint, we can write the bias as \(\sum_{i=1}^n w_i [f(x_i) - f(0)]\). Worst-case bias: \[\begin{align*} \overline{\operatorname{bias}}_{\mathcal{F}_{\text{SY},1}}(w) &= \sup \sum_{i=1}^n w_i [f(x_i) - f(0)] \quad\text{s.t.}\quad |f(x) - f(0)|\le C|x| \text{ all }x \\ &= C \sum_{i=1}^n |w_i|\cdot |x_i| \end{align*}\] Last step follows by noting that \(f(x_i)=C |x_i|\cdot \operatorname{sign}(w_i)\) maximizes each term in the sum.
Variance is \(\sigma^2 w'w=\sigma^2\sum_{i=1}^n w_i^2\). Incorporating constraint \(\sum_{i=1}^n w_i=1\) leads to optimization problem for optimal weights: \[ \min \sum_{i=1}^n w_i^2 \quad\text{s.t.}\quad C \sum_{i=1}^n |w_i|\cdot |x_i|\le B, \, \sum_{i=1}^n w_i=1. \]
Lagrangian: \[ \mathcal{L}=\frac{1}{2}\sum_{i=1}^n w_i^2 +\lambda \sum_{i=1}^n |w_i|\cdot |x_i| +\mu \left( 1-\sum_{i=1}^n w_i \right) \]
FOC for \(w_i\): \[\begin{align*} &0 = \frac{d \mathcal{L}}{d w_i} = w_i + \lambda |x_i|\cdot \text{sign}(w_i) - \mu \\ &\Longrightarrow w_i = \mu - \lambda |x_i|\cdot \text{sign}(w_i) \end{align*}\] where we use the convention that \(\operatorname{sign}(w_i)\) can be anything in \([-1,1]\) when \(w_i=0\).
Three possibilities:
It follows that optimal weights \(w_i^*\) satisfy \(w_i^*=\max\{\mu-\lambda |x_i|,0\}\).
Incorporating constraint \(\sum_{i=1}^n w_i=1\) yields optimal weights \[ w_i^* = \frac{\max\{\mu-\lambda |x_i|,0\}}{\sum_{j=1}^n\max\{\mu-\lambda |x_j|,0\}} \]
Resulting estimator is a Nadaraya-Watson estimator: \[ \sum_{i=1}^n w_i^* Y_i = \frac{\sum_{i=1}^n Y_i k_{\operatorname{tri}}(x_i/h)}{\sum_{i=1}^n k_{\operatorname{tri}}(x_i/h)} \] where \(k_{\operatorname{tri}}(t)=\max\left\{1-|t|,0\right\}\) is the triangular kernel and \(h=\mu/\lambda\) is the bandwidth.
Optimal bandwidth depends on constraint \(B\) on bias through Lagrange multipliers \(\mu\) and \(\lambda\).
To allow for non-normal, heteroskedastic errors (and serial/cluster correlation, etc.), replace \(\operatorname{sd}(\hat L)\) with a robust standard error for \(\hat L=\sum_{i=1}^n w_iY_i\): \[ \hat L \pm \operatorname{cv}_\alpha\left(\frac{\overline{\operatorname{bias}}_{\mathcal{F}}(w)}{\operatorname{se}_{\operatorname{rob}}}\right)\cdot \operatorname{se}_{\operatorname{rob}} \]
For example, if we assume independence but allow heteroskedasticity, we can use \(\operatorname{se}_{\operatorname{rob}} = \sum_{i=1}^n w_i^2 \hat u_i^2\) where \(\hat u_i\) are residuals.
Note that the bound \(\overline{\operatorname{bias}}_{\mathcal{F}}(w)\) on bias is valid in finite samples, so we avoid asymptotic approximations that are taboo in the “low dimensional culture”
This feasible CI is honest in the sense of Li (1989): it is asymptotically valid uniformly over \(f\in\mathcal{F}\) (and over sufficiently regular distributions for \(u_i\))
To show this, suffices to show Lindeberg condition for weights \(w_i\). This gives central limit theorem for \(\hat L=\sum_{i=1}^n w_iY_i\).
Lindeberg condition holds for optimal weights under some general coditions in the nonparametric setting (see Armstrong and Kolesár (2018), Armstrong and Kolesár (2021b)).
Intuition: if one \(w_i\) dominates, one can reduce variance with little change in bias by making it closer to weights for nearby \(x_i\)’s
In some cases, exact optimal weights are complicated
For example, consider Hölder class with bound on second derivative (popular in RD). No closed form for optimal weights in finite samples. There are asymptotically optimal kernels, but they are also a bit cumbersome (Zhao 1997; Gao 2018).
Often, some simple class of weights is near-optimal. For example, local linear estimators with triangular (or even uniform) kernel are asymptotically near-optimal under second derivative bound (see Cheng, Fan, and Marron (1997), Armstrong and Kolesár (2020)), and there is a simple formula for worst-case bias in finite samples (see Armstrong and Kolesár (2020)).
The RDHonest
software package uses this approach to
compute bias-aware CIs for regression discontinuity in R and
Stata.
One also can check this in finite-samples in a particular application (see Armstrong and Kolesár (2018) for an example).
Optimal kernels for relaxed version of Hölder class: Sacks and Ylvisaker (1978), Armstrong and Kolesár (2018)
Optimal weights in Lipschitz/Hölder class using convex optimization software: Armstrong and Kolesár (2021a) (for ATE), Imbens and Wager (2019) (for RD)
Asymptotically optimal kernels: Zhao (1997), Gao (2018), references in Lepski and Tsybakov (2000)
High dimensional/semiparametric controls: Armstrong, Kolesár, and Kwon (2020) (resulting estimator similar to Robinson (1988))
Sparsity in high dimensions: nonconvex, but Javanmard and Montanari (2014) use minimax linear estimation based on convex bound on initial estimation error
Misspecified GMM (Armstrong and Kolesár 2021b)
Density estimation (covered in Armstrong and Kolesár (2020))
Deconvolution (Ignatiadis and Wager 2021)
Diff-in-diff (Rambachan and Roth 2019)
Weak IV robust fuzzy RD (Noack and Rothe 2019)
We have restricted attention to linear procedures, but it can be shown that the minimax linear procedure is (near) minimax among all procedures (Donoho (1994), Armstrong and Kolesár (2018), Yata (2021))
Optimal estimator and CI require knowledge of bound \(C\) on derivative to choose weights and form CI.
Can we avoid this?
For inference, no (see Low (1997), Armstrong and Kolesár (2018))
For estimation, typically lose \(\log n\) term (see Lepskii (1991))
Intuition: can’t estimate bias without auxiliary information
Auxiliary assumptions (need to lead to nonconvex or asymmetric parameter space):
Asymmetric parameter space (can allow for some degree of adaptation depending on the setting)
With many parameters: average coverage
What if \(\mathcal{F}\) is convex but asymmetric (e.g. monotonicity, sign constraints)? Consider minimax affine estimator \(\hat L=w_0+w'Y\).
Convex optimization problem similar to symmetric case. Set \(w_0\) to equalize worst-case upward and downward bias.
Minimax affine estimators still near minimax among all estimators (Donoho 1994).
In contrast to centrosymmetric case, some degree of adaptation for CIs is possible. See, among others, Cai and Low (2004), Armstrong and Kolesár (2018), Armstrong (2015), Kwon and Kwon (2020a), Kwon and Kwon (2020b)
Intuition: use shape constraints to bound bias of affine estimators.
Biased estimates \(\hat \theta_j\) of \(\theta_j\) for \(j=1,\ldots,J\): \[ E\hat\theta_j-\theta_j = \operatorname{bias}_j, \quad \operatorname{var}\left(\hat\theta_j\right) = \operatorname{se}_j^2 \]
Consider compound MSE: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}_j^2\right) \]
In many cases, compound loss makes adaptation possible
Why? Often possible to consistently estimate \(\frac{1}{J}\sum_{j=1}^n \operatorname{bias}_j^2\) as \(J\to\infty\) (using, e.g. cross-validation), but not when \(J=1\).
What is compound loss for CIs? Consider average coverage criterion at level \(\alpha\): \[ \frac{1}{J}\sum_{j=1}^J P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]
Weaker than classical coverage, which requires \[ \max_{1\le j \le J} P(\theta_j\notin \operatorname{CI}_j)\le \alpha \]
We can “break” impossibility of adaptation by only requiring average coverage.
Unbiased estimate \(Y_j\) of parameter \(\theta_j\) for \(j=1,\ldots,J\) (e.g. fixed effects)
To reduce variance, regress \(Y_j\) on covariates \(X_j\) and “shrink” towards regression estimate \(X_j'\hat\beta\): \[ \hat\theta_j = w Y_j + (1-w) X_j'\hat\beta \]
Shrinkage weight \(w\) trades off bias vs variance:
For simplicity, assume \(X_j'\hat\beta=0\) and that \(Y_j\) is exactly normal with same variance \(\sigma^2\) for all \(j\): \[ Y_j \sim N(\theta_j,\sigma^2) \] with \(\sigma^2\) known
Shrinkage estimator \(\hat\theta_j= w Y_j\sim N(\theta_j+\operatorname{bias}_j,\operatorname{se}^2)\) where \(\operatorname{bias}_j=w\theta_j-\theta_j=(w-1)\theta_j\) and \(\operatorname{se}^2=w^2\sigma^2\).
Compound MSE: \[ E\left[ \frac{1}{J}\sum_{j=1}^J (\hat\theta_j-\theta_j)^2 \right] = \frac{1}{J} \sum_{j=1}^J \left(\operatorname{bias}_j^2 + \operatorname{se}^2\right) = (w-1)^2 \frac{1}{J} \sum_{j=1}^J \theta_j^2 + w^2\sigma^2 \]
Compound MSE is \((w-1)^2 \mu_{2,J} + w^2\sigma^2\) where \(\mu_{2,J}=\frac{1}{J} \sum_{j=1}^J \theta_j^2\).
If \(\mu_{2,J}\) is known, can minimize compound MSE over \(w\) to get \[ w_{\operatorname{opt}}=\frac{\mu_{2,J}}{\mu_{2,J}+\sigma^2} \]
Estimate of \(\mu_{2,J}\): \[ \hat\mu_{2,J} = \frac{1}{J}\sum_{j=1}^J Y_j^2 - \sigma^2. \] (consistent as \(J\to\infty\)).
Plugging in \(\hat\mu_{2,J}\) (with possible modifications for finite sample performance) gives a Stein estimator, an example of an empirical Bayes estimator
The Stein estimator is adaptive to \(\mu_{2,J}\): it achieves the same compound MSE (as \(j\to\infty\)) as an estimator that uses prior knowledge of \(\mu_{2,J}\)
Empirical Bayes motivation: \(\theta_j\)’s drawn from same prior \(N(0,\mu_2)\), \(\hat\mu_{2,J}\) is an estimate of \(\mu_2\). The empirical Bayes estimate is adaptive to the unknown prior variance \(\mu_2\).
How can we use this estimator to get an adaptive CI (or average coverage interval)? We will describe an approach from Armstrong, Kolesár, and Plagborg-Møller (2022).
Consider CIs based on shrinkage estimator \(\hat\theta_j=wY_j\): \[ \operatorname{CI}_j=\left\{\hat\theta_j \pm \chi \cdot \operatorname{se}\right\} \] where \(\chi\) is some critical value.
Basic idea: use \(\hat\mu_{2,J}\) to estimate a bound on average non-coverage.
First look at individual non-coverage. Note that \((\hat\theta_j-\theta_j)/\operatorname{se}\stackrel{d}{=}Z+\text{bias}_j/\operatorname{se}\) where \(Z\sim N(0,1)\). Thus, letting \(b_j=\text{bias}_j/\operatorname{se}\), we have \[\begin{align*} &P\left(\theta_j\notin \operatorname{CI}_j\right) =P\left(\left|\frac{\hat\theta_j-\theta_j}{\operatorname{se}}\right|\ge \chi\right) = P\left(\left|Z+b_j\right|\ge \chi\right) \\ &= \Phi\left(-\chi-b_j\right) + \Phi\left(-\chi+b_j\right) =: r(b_j,\chi). \end{align*}\]
Thus, average non-coverage is \[ \frac{1}{J}\sum_{j=1}^J P\left(\theta_j\notin \operatorname{CI}_j\right) =\frac{1}{J}\sum_{j=1}^J r(b_j,\chi) =E_{b\sim F_J} r(b,\chi) \] where \(F_J\) is the empirical distribution of \(b_j=\operatorname{bias}_j/\operatorname{se}=[(w-1)\theta_j]/(w\sigma^2)\) over \(j=1,\ldots, J\).
How can we bound this? Note that we can use \(\mu_{2,J}\) to compute \(E_{b\sim F_J} b^2\): \[ E_{b\sim F_J} b^2 = \frac{1}{J}\sum_{j=1}^J \frac{(w-1)^2}{w^2\sigma^2}\theta_j^2 = \frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J} \]
Thus, average non-coverage is bounded by \(\rho\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J},\chi\right)\) where \[ \rho\left(t,\chi\right) :=\sup E_{b\sim F} r(b_j,\chi) \quad\text{s.t.}\quad E_F b^2 \le t. \]
Let \(\operatorname{cva}_\alpha(t)=\rho^{-1}(t,\alpha)\) where the inverse is with respect to the second argument. Setting \(\chi=\operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right)\) gives the interval \[ \operatorname{CI}_j=\hat\theta_j \pm \operatorname{cva}_\alpha\left(\frac{(w-1)^2}{w^2\sigma^2}\mu_{2,J}\right) \operatorname{se} \]
Feasible average coverage interval: plug in consistent estimate \(\hat\mu_{2,J}\) of \(\mu_{2,J}\).
To \(\operatorname{cva}_\alpha(t)\) we need to compute \(\rho(t,\alpha)\). This is a convex program in a one-dimensional probability distribution \(F\), which can be solved by gridding over \(F\). Actually, it turns out that there is a closed-form solution (see Armstrong, Kolesár, and Plagborg-Møller (2022)).
Implemented in R, Matlab and Stata packages.
The shrinkage estimator \(\hat\theta_j=wY_j\) with the plug-in optimal weight \(w=\hat\mu_{2,J}/(\hat\mu_{2,J}+\sigma^2)\) has an empirical Bayes interpretation:
What about the average coverage intervals we constructed? They are empirical Bayes confidence intervals (EBCIs) in the sense of Morris (1983): they have ex-ante coverage at least \(1-\alpha\) over the prior distribution of \(\theta_j\).
If the \(N(0,\mu_2)\) prior is correct, one can use a parametric EBCI by plugging \(\hat\mu_{2,J}\) into Bayesian credible interval (see Morris (1983)).
In contrast, the intervals we constructed are robust to failure of \(N(0,\mu_2)\) prior.
Average coverage (and related notions of coverage) originated in nonparametric regression literature.
Ideas described above extend to other settings where estimate of average squared bias-sd ratio is available
Multiple testing: certain multiple testing adjustments still yield false discovery rate (FDR) control when applied to average coverage controlling CIs/tests (Armstrong (2022))
Thus, we can “break” impossibility of adaptation results by requiring FDR control instead of classical size control for each test
RDHonest
(package for R and Stata): bias-aware CIs
in regression discontinuity
ATEHonest
(package for R): average treatment effects
under unconfoundedness
ebci
/ebreg
(package for R, Stata and
Matlab): empirical Bayes confidence intervals