This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure.
1.1
x: numeric matrix (of dimension [nxp]), not supposed to contain missing values.
collect: a multiplication factor c, when init.sel is not "manual", to define m, the size of the initial basic subset, as c∗p, in practice, m <- min(p * collect, n/2).
m: integer in 1:n specifying the size of the initial basic subset; used only when init.sel is not "manual".
alpha: determines the cutoff value for the Mahalanobis distances (see details).
init.sel: character string, specifying the initial selection mode; implemented modes are:
"Mahalanobis": based on Mahalanobis distances (default); the version V1 of the reference; affine invariant but not robust.
"dUniMedian": based on the distances from the uni variate medians; similar to the version V2 of the reference; robust but not affine invariant.
"random": based on a random selection, i.e., reproducible only via set.seed().
"manual": based on manual selection; in this case, a vector man.sel containing the indices of the selected observations must be specified.
"V2": based on the Euclidean norm from the uni variate medians; this is the version V2 of the reference; robust but not affine invariant.
"Mahalanobis" and "V2" where proposed by Hadi and the other authors in the reference as versions V_1
and V_2 , as well as "manual", while "random" is provided in order to study the behaviour of BACON. Option "dUniMedian" is similar to "V2" and is due to U. Oetliker.
man.sel: only when init.sel == "manual", the indices of observations determining the initial basic subset (and m <- length(man.sel)).
maxsteps: maximal number of iteration steps.
allowSingular: logical indicating a solution should be sought also when no matrix of rank p is found.
verbose: logical indicating if messages are printed which trace progress of the algorithm.
Details
Remarks on the tuning parameter alpha: Let χp2
be a chi-square distributed random variable with p degrees of freedom (p is the number of variables; n is the number of observations). Denote the (1−α) quantile by χp2(α), e.g., χp2(0.05) is the 0.95 quantile. Following Billor et al. (2000), the cutoff value for the Mahalanobis distances is defined as χp(α/n) (the square root of chip2) times a correction factor c(n,p), n and p, and they use α=0.05.
Returns
a list with components - subset: logical vector of length n where the i-th entry is true iff the i-th observation is part of the final selection.
dis: numeric vector of length n with the (Mahalanobis) distances.
cov: pxp matrix, the corresponding robust estimate of covariance.
References
Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34 , 279--298. tools:::Rd_expr_doi("10.1016/S0167-9473(99)00101-2")
Author(s)
Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1. Port to , testing etc, by Martin Maechler; Init selection "V2" and correction of default alpha from 0.95 to 0.05, by Tobias Schoch, FHNW Olten, Switzerland.
See Also
covMcd for a high-breakdown (but more computer intensive) method; BACON for a generalization , notably to regression.
Examples
require(robustbase)# for example data and covMcd():## simple 2D example : plot(starsCYG, main ="starsCYG data (n=47)") B.st <- mvBACON(starsCYG) points(starsCYG[! B.st$subset,], pch =4, col =2, cex =1.5) stopifnot(identical(which(!B.st$subset), c(7L,11L,20L,30L,34L)))## finds the 4 clear outliers (and 1 "borderline");## it does not find obs. 14 which is an outlier according to covMcd(.) iniS <- setNames(, eval(formals(mvBACON)$init.sel))# all initialization methods, incl "random" set.seed(123) Bs.st <- lapply(iniS[iniS !="manual"],function(s) mvBACON(as.matrix(starsCYG), init.sel = s, verbose=FALSE)) ii <-- match("steps", names(Bs.st[[1]])) Bs.s1 <- lapply(Bs.st, `[`, ii) stopifnot(exprs ={ length(Bs.s1)>=4 length(unique(Bs.s1))==1# all 4 methods give the same})## Example where "dUniMedian" and "V2" differ : data(pulpfiber, package="robustbase") dU.plp <- mvBACON(as.matrix(pulpfiber), init.sel ="dUniMedian") V2.plp <- mvBACON(as.matrix(pulpfiber), init.sel ="V2")(oU <- which(! dU.plp$subset))(o2 <- which(! V2.plp$subset)) stopifnot(setdiff(o2, oU)%in% c(57L,58L,59L,62L))## and 57, 58, 59, and 62 *are* outliers according to covMcd(.)## 'coleman' from pkg 'robustbase' coleman.x <- data.matrix(coleman[,1:6]) Cc <- covMcd (coleman.x)# truly robust summary(Cc)# -> 6 outliers (1,3,10,12,17,18) Cb1 <- mvBACON(coleman.x)##-> subset is all TRUE hmm?? Cb2 <- mvBACON(coleman.x, init.sel ="dUniMedian") stopifnot(all.equal(Cb1, Cb2))## try 20 different random starts: Cb.r <- lapply(1:20,function(i){ set.seed(i) mvBACON(coleman.x, init.sel="random", verbose=FALSE)}) nm <- names(Cb.r[[1]]); nm <- nm[nm !="steps"] all(eqC <- sapply(Cb.r[-1],function(CC) all.equal(CC[nm], Cb.r[[1]][nm])))# TRUE## --> BACON always breaks down, i.e., does not see the outliers here## breaks down even when manually starting with all the non-outliers: Cb.man <- mvBACON(coleman.x, init.sel ="manual", man.sel = setdiff(1:20, c(1,3,10,12,17,18))) which(! Cb.man$subset)# the outliers according to mvBACON : _none_