This function performs the model selection and the maximum likelihood estimation. It can be used for clustering only (i.e., all the variables are assumed to be discriminative). In this case, you must specify the data to cluster (arg. x), the number of clusters (arg. g) and the option vbleSelec must be FALSE. This function can also be used for variable selection in clustering. In this case, you must specify the data to analyse (arg. x), the number of clusters (arg. g) and the option vbleSelec must be TRUE. Variable selection can be done with BIC, MICL or AIC.
x: data.frame/matrix. Rows correspond to observations and columns correspond to variables. Continuous variables must be "numeric", count variables must be "integer" and categorical variables must be "factor"
gvals: numeric. It defines number of components to consider.
vbleSelec: logical. It indicates if a variable selection is done
crit.varsel: character. It defines the information criterion used for model selection. Without variable selection, you can use one of the three criteria: "AIC", "BIC" and "ICL". With variable selection, you can use "AIC", BIC" and "MICL".
initModel: numeric. It gives the number of initializations of the alternated algorithm maximizing the MICL criterion (only used if crit.varsel="MICL")
nbcores: numeric. It defines the numerber of cores used by the alogrithm
discrim: numeric. It indicates if each variable is discrimiative (1) or irrelevant (0) (only used if vbleSelec=0)
nbSmall: numeric. It indicates the number of SmallEM algorithms performed for the ML inference
iterSmall: numeric. It indicates the number of iterations for each SmallEM algorithm
nbKeep: numeric. It indicates the number of chains used for the final EM algorithm
iterKeep: numeric. It indicates the maximal number of iterations for each EM algorithm
tolKeep: numeric. It indicates the maximal gap between two successive iterations of EM algorithm which stops the algorithm
Returns
Returns an instance of VSLCMresults .
Examples
## Not run:# Package loadingrequire(VarSelLCM)# Data loading:# x contains the observed variables# z the known statu (i.e. 1: absence and 2: presence of heart disease)data(heart)ztrue <- heart[,"Class"]x <- heart[,-13]# Cluster analysis without variable selectionres_without <- VarSelCluster(x,2, vbleSelec =FALSE, crit.varsel ="BIC")# Cluster analysis with variable selection (with parallelisation)res_with <- VarSelCluster(x,2, nbcores =2, initModel=40, crit.varsel ="BIC")# Comparison of the BIC for both models:# variable selection permits to improve the BICBIC(res_without)BIC(res_with)# Confusion matrices and ARI (only possible because the "true" partition is known).# ARI is computed between the true partition (ztrue) and its estimators# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)# variable selection permits to improve the ARI# Note that ARI cannot be used for model selection in clustering, because there is no true partition# variable selection decreases the misclassification error ratetable(ztrue, fitted(res_without))table(ztrue, fitted(res_with))ARI(ztrue, fitted(res_without))ARI(ztrue, fitted(res_with))# Estimated partitionfitted(res_with)# Estimated probabilities of classificationhead(fitted(res_with, type="probability"))# Summary of the probabilities of missclassificationplot(res_with, type="probs-class")# Summary of the best modelsummary(res_with)# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)plot(res_with)# More detailed outputprint(res_with)# Print model parametercoef(res_with)# Boxplot for the continuous variable MaxHeartRateplot(x=res_with, y="MaxHeartRate")# Empirical and theoretical distributions of the most discriminative variable # (to check that the distribution is well-fitted)plot(res_with, y="MaxHeartRate", type="cdf")# Summary of categorical variableplot(res_with, y="Sex")# Probabilities of classification for new observations predict(res_with, newdata = x[1:3,])# Imputation by posterior mean for the first observationnot.imputed <- x[1,]imputed <- VarSelImputation(res_with, x[1,], method ="sampling")rbind(not.imputed, imputed)# Opening Shiny application to easily see the resultsVarSelShiny(res_with)## End(Not run)
References
Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.
Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of Classification, to appear.