MoE_mahala function

Mahalanobis Distance Outlier Detection for Multivariate Response

Mahalanobis Distance Outlier Detection for Multivariate Response

Computes the Mahalanobis distance between the response variable(s) and the fitted values of linear regression models with multivariate or univariate responses.

MoE_mahala(fit, resids, squared = FALSE, identity = NULL)

Arguments

  • fit: A fitted lm model, inheriting either the "mlm" or "lm" class.
  • resids: The residuals. Can be residuals for observations included in the model, or residuals arising from predictions on unseen data. Must be coercible to a matrix with the number of columns being the number of response variables. Missing values are not allowed.
  • squared: A logical. By default (FALSE), the generalized interpoint distance is computed. Set this flag to TRUE for the squared value.
  • identity: A logical indicating whether the identity matrix is used in place of the precision matrix in the Mahalanobis distance calculation. Defaults to FALSE for multivariate response data but defaults to TRUE for univariate response data, where TRUE corresponds to the use of the Euclidean distance. Setting identity=TRUE with multivariate data may be advisable when the dimensions of the data are such that the covariance matrix cannot be inverted (otherwise, the pseudo-inverse is used under the FALSE default).

Returns

A vector giving the Mahalanobis distance (or squared Mahalanobis distance) between response(s) and fitted values for each observation.

Examples

data(ais) hema <- as.matrix(ais[,3:7]) mod <- lm(hema ~ sex + BMI, data=ais) res <- hema - predict(mod) MoE_mahala(mod, res, squared=TRUE) data(CO2data) CO2 <- CO2data$CO2 GNP <- CO2data$GNP mod2 <- lm(CO2 ~ GNP, data=CO2data) pred <- predict(mod2) res2 <- CO2 - pred maha <- MoE_mahala(mod2, res2) # Highlight outlying observations plot(GNP, CO2, type="n", ylab=expression('CO'[2])) lines(GNP, pred, col="red") points(GNP, CO2, cex=maha, lwd=2) text(GNP, CO2, col="blue", labels=replace(as.character(CO2data$country), maha < 1, "")) # Replicate initialisation strategy using 2 randomly chosen components # Repeat the random initialisation if necessary # (until 'crit' at convergence is minimised) G <- 3L z <- sample(seq_len(G), nrow(CO2data), replace=TRUE) old <- Inf crit <- .Machine$double.xmax while(crit < old) { Sys.sleep(1) old <- crit maha <- NULL plot(GNP, CO2, type="n", ylab=expression('CO'[2])) for(g in seq_len(G)) { ind <- which(z == g) mod <- lm(CO2 ~ GNP, data=CO2data, sub=ind) pred <- predict(mod, newdata=CO2data[,"CO2", drop=FALSE]) maha <- cbind(maha, MoE_mahala(mod, CO2 - pred)) lines(GNP, pred, col=g + 1L) } min.M <- rowMins(maha) crit <- sum(min.M) z <- max.col(maha == min.M) points(GNP, CO2, cex=min.M, lwd=2, col=z + 1L) text(GNP, CO2, col=z + 1L, labels=replace(as.character(CO2data$country), which(min.M <= 1), "")) } crit

References

Murphy, K. and Murphy, T. B. (2020). Gaussian parsimonious clustering models with covariates and a noise component. Advances in Data Analysis and Classification, 14(2): 293-325. <tools:::Rd_expr_doi("10.1007/s11634-019-00373-8") >.

Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Sciences, India, 2(1): 49-55.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie >

  • Maintainer: Keefe Murphy
  • License: GPL (>= 3)
  • Last published: 2025-03-05