BEM function

BACON-EEM Algorithm for multivariate outlier detection in incomplete multivariate survey data

BACON-EEM Algorithm for multivariate outlier detection in incomplete multivariate survey data

BEM starts from a set of uncontaminated data with possible missing values, applies a version of the EM-algorithm to estimate the center and scatter of the good data, then adds (or deletes) observations to the good data which have a Mahalanobis distance below a threshold. This process iterates until the good data remain stable. Observations not among the good data are outliers.

BEM( data, weights, v = 2, c0 = 3, alpha = 0.01, md.type = "m", em.steps.start = 10, em.steps.loop = 5, better.estimation = FALSE, monitor = FALSE )

Arguments

  • data: a matrix or data frame. As usual, rows are observations and columns are variables.

  • weights: a non-negative and non-zero vector of weights for each observation. Its length must equal the number of rows of the data. Default is rep(1, nrow(data)).

  • v: an integer indicating the distance for the definition of the starting good subset: v = 1 uses the Mahalanobis distance based on the weighted mean and covariance, v = 2 uses the Euclidean distance from the componentwise median.

  • c0: the size of initial subset is c0 * ncol(data).

  • alpha: a small probability indicating the level (1 - alpha)

    of the cutoff quantile for good observations.

  • md.type: type of Mahalanobis distance: "m" marginal, "c" conditional.

  • em.steps.start: number of iterations of EM-algorithm for starting good subset.

  • em.steps.loop: number of iterations of EM-algorithm for good subset.

  • better.estimation: if better.estimation = TRUE, then the EM-algorithm for the final good subset iterates em.steps.start more.

  • monitor: if TRUE, verbose output.

Returns

BEM returns a list whose first component output is a sublist with the following components:

  • sample.size: Number of observations
  • discarded.observations: Number of discarded observations
  • number.of.variables: Number of variables
  • significance.level: The probability used for the cutpoint, i.e. alpha
  • initial.basic.subset.size: Size of initial good subset
  • final.basic.subset.size: Size of final good subset
  • number.of.iterations: Number of iterations of the BACON step
  • computation.time: Elapsed computation time
  • center: Final estimate of the center
  • scatter: Final estimate of the covariance matrix
  • cutpoint: The threshold MD-value for the cut-off of outliers

The further components returned by BEM are:

  • outind: Indicator of outliers
  • dist: Final Mahalanobis distances

Details

The BACON algorithm with v = 1 is not robust but affine equivariant while v = 1 is robust but not affine equivariant. The threshold for the (squared) Mahalanobis distances, beyond which an observation is an outlier, is a standardised chisquare quantile at (1 - alpha). For large data sets it may be better to choose alpha / n instead. The internal function EM.normal is usually called from BEM. EM.normal is implementing the EM-algorithm in such a way that part of the calculations can be saved to be reused in the BEM

algorithm. EM.normal does not contain the computation of the observed sufficient statistics, they will be computed in the main program of BEM and passed as parameters as well as the statistics on the missingness patterns.

Note

BEM uses an adapted version of the EM-algorithm in function .EM-normal.

Examples

# Bushfire data set with 20% MCAR data(bushfirem, bushfire.weights) bem.res <- BEM(bushfirem, bushfire.weights, alpha = (1 - 0.01 / nrow(bushfirem))) print(bem.res$output)

References

Béguin, C. and Hulliger, B. (2008) The BACON-EEM Algorithm for Multivariate Outlier Detection in Incomplete Survey Data, Survey Methodology, Vol. 34, No. 1, pp. 91-103.

Billor, N., Hadi, A.S. and Vellemann, P.F. (2000). BACON: Blocked Adaptative Computationally-efficient Outlier Nominators. Computational Statistics and Data Analysis, 34(3), 279-298.

Schafer J.L. (2000), Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability 72, Chapman & Hall.

Author(s)

Beat Hulliger

  • Maintainer: Beat Hulliger
  • License: MIT + file LICENSE
  • Last published: 2023-03-14