mvBACON function

BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators

BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators

This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure. 1.1

mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.05, init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"), man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE)

Arguments

  • x: numeric matrix (of dimension [nxp][n x p]), not supposed to contain missing values.

  • collect: a multiplication factor cc, when init.sel is not "manual", to define mm, the size of the initial basic subset, as cpc * p, in practice, m <- min(p * collect, n/2).

  • m: integer in 1:n specifying the size of the initial basic subset; used only when init.sel is not "manual".

  • alpha: determines the cutoff value for the Mahalanobis distances (see details).

  • init.sel: character string, specifying the initial selection mode; implemented modes are:

    • "Mahalanobis": based on Mahalanobis distances (default); the version V1V1 of the reference; affine invariant but not robust.
    • "dUniMedian": based on the distances from the uni variate medians; similar to the version V2V2 of the reference; robust but not affine invariant.
    • "random": based on a random selection, i.e., reproducible only via set.seed().
    • "manual": based on manual selection; in this case, a vector man.sel containing the indices of the selected observations must be specified.
    • "V2": based on the Euclidean norm from the uni variate medians; this is the version V2V2 of the reference; robust but not affine invariant.

    "Mahalanobis" and "V2" where proposed by Hadi and the other authors in the reference as versions V_1

    and V_2 , as well as "manual", while "random" is provided in order to study the behaviour of BACON. Option "dUniMedian" is similar to "V2" and is due to U. Oetliker.

  • man.sel: only when init.sel == "manual", the indices of observations determining the initial basic subset (and m <- length(man.sel)).

  • maxsteps: maximal number of iteration steps.

  • allowSingular: logical indicating a solution should be sought also when no matrix of rank pp is found.

  • verbose: logical indicating if messages are printed which trace progress of the algorithm.

Details

Remarks on the tuning parameter alpha: Let χp2\chi^2_p

be a chi-square distributed random variable with pp degrees of freedom (pp is the number of variables; nn is the number of observations). Denote the (1α)(1-\alpha) quantile by χp2(α)\chi^2_p(\alpha), e.g., χp2(0.05)\chi^2_p(0.05) is the 0.95 quantile. Following Billor et al. (2000), the cutoff value for the Mahalanobis distances is defined as χp(α/n)\chi_p(\alpha/n) (the square root of chip2chi^2_p) times a correction factor c(n,p)c(n,p), nn and pp, and they use α=0.05\alpha=0.05.

Returns

a list with components - subset: logical vector of length n where the i-th entry is true iff the i-th observation is part of the final selection.

  • dis: numeric vector of length n with the (Mahalanobis) distances.

  • cov: pxpp x p matrix, the corresponding robust estimate of covariance.

References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34 , 279--298. tools:::Rd_expr_doi("10.1016/S0167-9473(99)00101-2")

Author(s)

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1. Port to , testing etc, by Martin Maechler; Init selection "V2" and correction of default alpha from 0.95 to 0.05, by Tobias Schoch, FHNW Olten, Switzerland.

See Also

covMcd for a high-breakdown (but more computer intensive) method; BACON for a generalization , notably to regression.

Examples

require(robustbase) # for example data and covMcd(): ## simple 2D example : plot(starsCYG, main = "starsCYG data (n=47)") B.st <- mvBACON(starsCYG) points(starsCYG[ ! B.st$subset,], pch = 4, col = 2, cex = 1.5) stopifnot(identical(which(!B.st$subset), c(7L,11L,20L,30L,34L))) ## finds the 4 clear outliers (and 1 "borderline"); ## it does not find obs. 14 which is an outlier according to covMcd(.) iniS <- setNames(, eval(formals(mvBACON)$init.sel)) # all initialization methods, incl "random" set.seed(123) Bs.st <- lapply(iniS[iniS != "manual"], function(s) mvBACON(as.matrix(starsCYG), init.sel = s, verbose=FALSE)) ii <- - match("steps", names(Bs.st[[1]])) Bs.s1 <- lapply(Bs.st, `[`, ii) stopifnot(exprs = { length(Bs.s1) >= 4 length(unique(Bs.s1)) == 1 # all 4 methods give the same }) ## Example where "dUniMedian" and "V2" differ : data(pulpfiber, package="robustbase") dU.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "dUniMedian") V2.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "V2") (oU <- which(! dU.plp$subset)) (o2 <- which(! V2.plp$subset)) stopifnot(setdiff(o2, oU) %in% c(57L,58L,59L,62L)) ## and 57, 58, 59, and 62 *are* outliers according to covMcd(.) ## 'coleman' from pkg 'robustbase' coleman.x <- data.matrix(coleman[, 1:6]) Cc <- covMcd (coleman.x) # truly robust summary(Cc) # -> 6 outliers (1,3,10,12,17,18) Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm?? Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian") stopifnot(all.equal(Cb1, Cb2)) ## try 20 different random starts: Cb.r <- lapply(1:20, function(i) { set.seed(i) mvBACON(coleman.x, init.sel="random", verbose=FALSE) }) nm <- names(Cb.r[[1]]); nm <- nm[nm != "steps"] all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[[1]][nm]))) # TRUE ## --> BACON always breaks down, i.e., does not see the outliers here ## breaks down even when manually starting with all the non-outliers: Cb.man <- mvBACON(coleman.x, init.sel = "manual", man.sel = setdiff(1:20, c(1,3,10,12,17,18))) which( ! Cb.man$subset) # the outliers according to mvBACON : _none_
  • Maintainer: Martin Maechler
  • License: GPL (>= 2)
  • Last published: 2023-06-16

Useful links

    Downloads (last 30 days):