VarSelCluster function

Variable selection and clustering.

Variable selection and clustering.

This function performs the model selection and the maximum likelihood estimation. It can be used for clustering only (i.e., all the variables are assumed to be discriminative). In this case, you must specify the data to cluster (arg. x), the number of clusters (arg. g) and the option vbleSelec must be FALSE. This function can also be used for variable selection in clustering. In this case, you must specify the data to analyse (arg. x), the number of clusters (arg. g) and the option vbleSelec must be TRUE. Variable selection can be done with BIC, MICL or AIC.

VarSelCluster(x, gvals, vbleSelec = TRUE, crit.varsel = "BIC", initModel = 50, nbcores = 1, discrim = rep(1, ncol(x)), nbSmall = 250, iterSmall = 20, nbKeep = 50, iterKeep = 1000, tolKeep = 10^(-6))

Arguments

  • x: data.frame/matrix. Rows correspond to observations and columns correspond to variables. Continuous variables must be "numeric", count variables must be "integer" and categorical variables must be "factor"
  • gvals: numeric. It defines number of components to consider.
  • vbleSelec: logical. It indicates if a variable selection is done
  • crit.varsel: character. It defines the information criterion used for model selection. Without variable selection, you can use one of the three criteria: "AIC", "BIC" and "ICL". With variable selection, you can use "AIC", BIC" and "MICL".
  • initModel: numeric. It gives the number of initializations of the alternated algorithm maximizing the MICL criterion (only used if crit.varsel="MICL")
  • nbcores: numeric. It defines the numerber of cores used by the alogrithm
  • discrim: numeric. It indicates if each variable is discrimiative (1) or irrelevant (0) (only used if vbleSelec=0)
  • nbSmall: numeric. It indicates the number of SmallEM algorithms performed for the ML inference
  • iterSmall: numeric. It indicates the number of iterations for each SmallEM algorithm
  • nbKeep: numeric. It indicates the number of chains used for the final EM algorithm
  • iterKeep: numeric. It indicates the maximal number of iterations for each EM algorithm
  • tolKeep: numeric. It indicates the maximal gap between two successive iterations of EM algorithm which stops the algorithm

Returns

Returns an instance of VSLCMresults .

Examples

## Not run: # Package loading require(VarSelLCM) # Data loading: # x contains the observed variables # z the known statu (i.e. 1: absence and 2: presence of heart disease) data(heart) ztrue <- heart[,"Class"] x <- heart[,-13] # Cluster analysis without variable selection res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC") # Cluster analysis with variable selection (with parallelisation) res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC") # Comparison of the BIC for both models: # variable selection permits to improve the BIC BIC(res_without) BIC(res_with) # Confusion matrices and ARI (only possible because the "true" partition is known). # ARI is computed between the true partition (ztrue) and its estimators # ARI is an index between 0 (partitions are independent) and 1 (partitions are equals) # variable selection permits to improve the ARI # Note that ARI cannot be used for model selection in clustering, because there is no true partition # variable selection decreases the misclassification error rate table(ztrue, fitted(res_without)) table(ztrue, fitted(res_with)) ARI(ztrue, fitted(res_without)) ARI(ztrue, fitted(res_with)) # Estimated partition fitted(res_with) # Estimated probabilities of classification head(fitted(res_with, type="probability")) # Summary of the probabilities of missclassification plot(res_with, type="probs-class") # Summary of the best model summary(res_with) # Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate) plot(res_with) # More detailed output print(res_with) # Print model parameter coef(res_with) # Boxplot for the continuous variable MaxHeartRate plot(x=res_with, y="MaxHeartRate") # Empirical and theoretical distributions of the most discriminative variable # (to check that the distribution is well-fitted) plot(res_with, y="MaxHeartRate", type="cdf") # Summary of categorical variable plot(res_with, y="Sex") # Probabilities of classification for new observations predict(res_with, newdata = x[1:3,]) # Imputation by posterior mean for the first observation not.imputed <- x[1,] imputed <- VarSelImputation(res_with, x[1,], method = "sampling") rbind(not.imputed, imputed) # Opening Shiny application to easily see the results VarSelShiny(res_with) ## End(Not run)

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of Classification, to appear.