Confidence intervals for raw residuals
Checking the normality and equal variance of standardized residuals is one of the most common routines when fitting linear models and, although some statisticians argue they are the less important assumptions to consider unless you focus on prediction, easily available software prompts anyone to either display all kinds of graphical inspections and null hypothesis tests.
Consider fitting the model \(\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\) to \(n\) observations with \(p-1\) predictors. Asumming the error terms follow a multivariate normal distribution with mean vector \(\mathbf{0}\) and covariance matrix \(\sigma^2 \mathbf{I}\), we already know the covariance matrix of the residual vector \(\mathbf{e}\) is \(\sigma^2(\mathbf{I} - \mathbf{H})\), where \(\mathbf{H}\) is the influence matrix. If \(\sigma^2\) was known, confidence intervals could then be constructed because \(\mathbf{e}\) would also follows a multivariate normal distribution. Nonetheless, this situation is utopian because we actually need to estimate the error variance. In this case, we may think that the ith studentized residual, \(r_i = \frac{e_i}{\widehat\sigma \sqrt{1 - h_{ii}}}\), follows a t-distribution with \(n-p-1\) degrees of freedom but \(e_i\) and \(\widehat{\sigma}\) are not independent. At this point, we may appeal to asimptotic approximations to simply use the t-distribution or compute the externally studentized residuals, which employ the error variance estimate that excludes the ith case. If we however want to to use the whole data, Ellenberg (1973) derived the joint distribution of a kind of standardized residuals. We will follow his derivation to get instead the joint density of the internally studentized residuals and will supply R
code to plot confidence intervals for raw residuals. Because the improvement over the externally studentized residuals may not be huge, this derivation should be taken as a matter of curiosity rather than of practical value.
Let \(\mathbf{M} = \mathbf{I} - \mathbf{H}\) so that \(\mathbf{e} = \mathbf{M} \boldsymbol{\epsilon}\). Consider a \(k\)-subset of residuals. The sum of squared residuals obtained by removing that subset from the regression is \[ S_k^2 = S^2 - \mathbf{e}_k^\top \mathbf{M}_{k\times k}^{-1} \mathbf{e}_k, \] where \(S^2\) is the sum of squared residuals from the model fitted to the full data, \(\mathbf{e}_k\) is the vector containing the subset of \(k\) residuals and \(\mathbf{M}_{k \times k}\) is the corresponding partition matrix of \(\mathbf{M}\). We can rewrite it as \[ \begin{aligned} S_k^2 &= \boldsymbol{\epsilon}^\top \mathbf{M} \boldsymbol{\epsilon} - \boldsymbol{\epsilon}^\top (\mathbf{MM^*M}) \boldsymbol{\epsilon} \\ &= \boldsymbol{\epsilon}^\top (\mathbf{M} - \mathbf{MM^*M}) \boldsymbol{\epsilon}, \end{aligned} \] where \(\mathbf{M^*}\) is a \(n \times n\) matrix with all the entries not related to the \(k\)-subset set to zero. This is a quadratic form where \(\mathbf{M}\) is idempotent and the rank of \((\mathbf{M} - \mathbf{MM^*M})\) is \(n-p-k\), so \[ \frac{S_k^2}{\sigma^2} \sim \chi^2 \Big(n-p-k \Big). \]
\(\mathbf{e}_k\) is independent of \(S_k^2\) so their joint probability density is \[ f(\mathbf{e}_k, S_k^2) = {\sf MVN} \Big(\mathbf{e}_k; \ \mathbf{0}, \sigma^2 \boldsymbol{\Sigma} \Big) \ \chi^2 \Big(S_k^2/\sigma^2; \ n-p-k \Big) / \sigma^2, \] where \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{M}_{k \times k}\). Let’s just focus on the residual \(i\) from now on. Considering \(\widehat{\sigma} = \sqrt{S^2 / (n-k)}\), we apply a change of variables,
\[ \begin{aligned} r_i &= \frac{e_i}{\widehat{\sigma} \sqrt{m_{ii}}} \\ \widehat{\sigma}^2 &= \frac{S_i^2 + \widehat{\sigma}^2 r_i^2}{(n-p)} \end{aligned}, \]
Notice that by standardazing the residuals we ensure that \(\sigma^2 = 1\). The joint probability density now looks like \[ f(e_i, S_i^2) = \mathcal{N} \Big(r_i \widehat{\sigma} \sqrt{m_{ii}}; \ 0, m_{ii} \Big) \ \chi^2 \left(\widehat{\sigma}^2 (n-p-r_i^2); \ n-p-1 \right). \]
Denoting \(v = n-p\), the Jacobian of the transformation is \[ \begin{equation} \begin{vmatrix} \widehat{\sigma} \sqrt{m_{ii}} & r_i \sqrt{m_{ii}} \\ -2 \widehat{\sigma}^2r_i & 2 \widehat{\sigma}v - 2\widehat{\sigma} ri^2 \end{vmatrix} = 2\widehat{\sigma}^2v \sqrt{m_{ii}} \end{equation}. \]
Hence, \[ f(r_i, \widehat{\sigma}) = 2\widehat{\sigma}^2v \sqrt{m_{ii}} \ \mathcal{N} \Big(r_i \widehat{\sigma} \sqrt{m_{ii}}; \ 0, m_{ii} \Big) \ \chi^2 \Big(\widehat{\sigma}^2 (v-r_i^2); \ v-1 \Big). \]
Finally, simplifying and integrating over \(\widehat{\sigma}\) yields \[ f(r_i) = \frac{v^{1-v/2} (v-r_i^2)^{(v-3)/2} \Gamma \left(\frac{v}{2} \right)}{\Gamma \left(\frac{v-1}{2} \right) \sqrt{\pi}}, \quad \mid r_i \mid < \sqrt{v}. \]
Using \(\tt Mathematica\) we find \[ F(r_i) = \frac{v^{1-\frac{v}{2}} \Gamma \left(\frac{v}{2} \right)}{\Gamma \left(\frac{v-1}{2} \right) \sqrt{\pi}} r_i \left(1 - \frac{r_i^2}{v} \right)^{\frac{3-v}{2}} \Big(v-r_i^2 \Big)^{\frac{v-3}{2}} \\ _2F_1 \left( \frac{1}{2}, \frac{3-v}{2}, \frac{3}{2}, \frac{r^2}{v} \right) + C, \] where \(_2F_1\) is the hypergeometric function. A visual inspection of \(F(x)\) indicates that \(C = \frac{1}{2}\) and the calculation of p-values and confidence intervals is straightforward. We just need to find the \(1-\alpha/2\) percentile, multiplicate it by \(\widehat{\sigma}\sqrt{m_{ii}}\) and add \(e_i\).
The R
function resid_ci
takes the output of a lm
class object and does all the work.
# Distribution function of r:
pr <- function(r, v) {
if(abs(r) >= sqrt(v)) {
return(0)
}
v^(1-v/2)*gamma(v/2)/(gamma((v-1)/2) * sqrt(pi))*
r*(1-r^2/v)^((3-v)/2)*(v-r^2)^((v-3)/2)*
Re(hypergeo::hypergeo(1/2, (3-v)/2, 3/2, r^2/v)) + 1/2
}
# Notice you need the 'hypergeo' package.
# Function to get and plot the confidence intervals:
resid_ci <- function(lm, confidence, plot = NULL) {
alpha <- 1 - confidence
hat_y <- modelo$fitted.values
e <- modelo$residuals
v <- modelo$df.residual
hat_sigma2 <- sum(e^2) / v
m <- 1 - lm.influence(modelo)$hat
se <- sqrt(hat_sigma2 * m)
r <- e / se
# Bounds for the domain of r:
u <- sqrt(v) - 0.01
l <- -u
critical_value <- uniroot(f = function(r, v, alpha) pr(r, v) - (1-alpha/2),
v = v, alpha = alpha, lower = l, upper = u)$root
CI <- sapply(c(-1, 1), FUN = function(x) e + x*critical_value * se)
colnames(CI) <- c("lower limit", "upper limit")
if(!is.null(plot)) {
plot(hat_y[plot], e[plot], ylim = c(max(CI[plot, ]), min(CI[plot, ])),
ylab = "Residuals", xlab = "Fitted values")
sapply(plot, FUN = function(x) segments(x0 = hat_y[x], x1 = hat_y[x],
y0 = CI[x, 1], y1 = CI[x, 2]))
segments(x0 = min(hat_y[plot]), x1 = max(hat_y[plot]), y0 = 0, y1 = 0, lty = 2)
}
return(CI)
}
The plot
argument can be used to plot a subset of residuals by providing its indexes.
set.seed(25022020)
# Set-up:
n <- 20
p <- 4
v <- n-p
X <- cbind(1, MASS::mvrnorm(n, mu = rep(0, p-1), Sigma = diag(p-1)))
beta <- c(1, 0.2, 0.5, 0.8)
sigma <- rchisq(1, 5)
# Simulation:
epsilon <- rnorm(n, 0, sigma)
y <- X %*% beta + epsilon
# Model:
modelo <- lm(y ~ 0 + X)
# Plotting the confidence intervals with alpha = .05:
resid_ci(modelo, confidence = .95, plot = 1:n)
## lower limit upper limit
## 1 -19.464750 0.7951401
## 2 -17.696498 1.6501012
## 3 -5.550288 13.2235236
## 4 -15.831542 4.7842290
## 5 -6.212853 15.2809272
## 6 -10.152657 11.7342894
## 7 -14.961725 6.2331182
## 8 -5.813860 14.1868663
## 9 -16.738508 3.2498009
## 10 -13.733734 7.9669237
## 11 -8.529167 13.4177885
## 12 -10.779609 9.1548674
## 13 -12.121683 4.7811506
## 14 -6.199955 14.0267908
## 15 3.315743 23.4428353
## 16 -5.336643 15.7564900
## 17 -12.674718 8.8032118
## 18 -9.720376 9.6704875
## 19 -9.200522 12.9166644
## 20 -7.601659 13.9297983
And as we should expect in the long run under the homocedastic and normality assumptions, only one of the \(20\) residuals deviated from the model.
Reference
Ellenberg, J. H. (1973). The Joint Distribution of the Standardized Least Squares Residuals from a General Linear Regression. Journal of the American Statistical Association, 68(344), 941–943. https://doi.org/10.1080/01621459.1973.10481450