Generalised Pairs Plots for MoEClust Mixture Models
Generalised Pairs Plots for MoEClust Mixture Models
Produces a matrix of plots showing pairwise relationships between continuous response variables and continuous/categorical/logical/ordinal associated covariates, as well as the clustering achieved, according to fitted MoEClust mixture models.
res: An object of class "MoEClust" generated by MoE_clust, or an object of class "MoECompare" generated by MoE_compare. Models with a noise component are facilitated here too.
response.type: The type of plot desired for the scatterplots comparing continuous response variables. Defaults to "points". See scatter.pars below.
Points can also be sized according to their associated clustering uncertainty with the option "uncertainty". In doing so, the transparency of the points will also be proportional to their clustering uncertainty, provided the device supports transparency. See also MoE_Uncertainty for an alternative means of visualising observation-specific cluster uncertainties (especially for univariate data). See scatter.pars below, and note that models fitted via the "CEM" algorithm will have no associated clustering uncertainty.
Alternatively, the bivariate parametric"density" contours can be displayed (see density.pars), provided there is at least one Gaussian component in the model. Caution is advised when producing density plots for models with covariates in the expert network; the required number of evaluations of the (multivariate) Gaussian density for each panel (res$G * prod(density.pars$grid.size)) increases by a factor of res$n, thus plotting may be slow (particularly for large data sets). However, this is offset somewhat by using pre-calculated densities from the corresponding upper-triangular panels when producing the lower-triangular panels. See density.pars below.
subset: A list giving named arguments for producing only a subset of panels:
show.map: Logical indicating whether to show panels involving the MAP classification (defaults to TRUE, unless there is only one component, in which case the MAP classification is never plotted.).
data.ind: For subsetting response variables: a vector of column indices corresponding to the variables in the columns of res$data which should be shown. Defaults to all. Can be 0, in order to suppress plotting the response variables. Alternatively, character strings matching the column names of res$data can be supplied here.
cov.ind: For subsetting covariates: a vector of column indices corresponding to the covariates in the columns res$net.covs which should be shown. Defaults to all. Can be 0, in order to suppress plotting the covariates. Alternatively, character strings matching the column names of res$net.covs can be supplied here.
submat: Can take the values "all" (default), "upper", "lower", or "diagonal", for displaying all panels or only the upper/lower triangular panels or diagonal (marginal) panels of the plot matrix.
The results of the subsetting must ensure that at least one panel of some sort can be plotted. The arguments data.ind and cov.ind can also be used to simply reorder the panels, without actually subsetting. Diagonal panels are always drawn, regardless of the value of submat (but can be somewhat suppressed using diag.pars$show.hist=FALSE and diag.pars$show.dens=FALSE; see diag.pars below). When diag.pars$diagonal=TRUE (the default), the triangular portions are the "upper"-right and "lower"-left, whereas they are the "upper"-left and "lower"-right when diag.pars$diagonal=FALSE. Generally, submat="upper" should be preferable to submat="lower", as it ensures that response variables and covariates are displayed as appropriate on the y-axes and x-axes, respectively.
scatter.type: A vector of length 2 (or 1) giving the plot type for the upper and lower triangular portions of the plot, respectively, pertaining to the associated covariates. Defaults to "lm" for covariate vs. response panels and "points" otherwise. Only relevant for models with continuous covariates in the gating &/or expert network. "ci" and "lm" type plots are only produced for plots pairing covariates with response, and never response vs. response or covariate vs. covariate. Note that lines &/or confidence intervals will only be drawn for continuous covariates included in the expert network; to include covariates included only in the gating network also, the options "lm2" or "ci2" can be used but this is not generally advisable. See scatter.pars below.
conditional: A vector of length 2 (or 1) giving the plot type for the upper and lower triangular portions of the plot, respectively, for plots involving a mix of categorical and continuous variables. Defaults to "stripplot" in the upper triangle and "boxplot" in the lower triangle (see panel.stripplot and panel.bwplot). "violin" and "barcode" plots can also be produced. Only relevant for models with categorical covariates in the gating &/or expert network, unless show.MAP is TRUE. Comparisons of two categorical variables (which can only ever be covariates or the MAP classification) are always displayed via mosaic plots (see strucplot).
All conditional panel types can be customised further; see stripplot.pars, boxplot.pars (for both "boxplot" and "violin" plots), barcode.pars, and mosaic.pars below. Note that when conditional is of length 1, that plot type will be used in both the upper and lower triangular portions of the plot, where relevant.
addEllipses: Controls whether to add MVN ellipses with axes corresponding to the within-cluster covariances for the response data. The options "inner" and "outer" (the default) will colour the axes or the perimeter of those ellipses, respectively, according to the cluster they represent (according to scatter.pars$eci.col). The option "both" will obviously colour both the axes and the perimeter. The "yes" or "no" options merely govern whether the ellipses are drawn, i.e. "yes" draws ellipses without any colouring. Ellipses are only ever drawn for multivariate data, and only when response.type is "points" or "uncertainty".
Ellipses are centered on the posterior mean of the fitted values when there are expert network covariates, otherwise on the posterior mean of the response variables. In the presence of expert network covariates, the component-specific covariance matrices are also (by default, via the argument expert.covar below) modified for plotting purposes via the function expert_covar, in order to account for the extra variability of the means, usually resulting in bigger shapes & sizes for the MVN ellipses.
expert.covar: Logical (defaults to TRUE) governing whether the extra variability in the component means is added to the MVN ellipses corresponding to the component covariance matrices in the presence of expert network covariates. See the function expert_covar. Only relevant when response.type is "points" or "uncertainty" when addEllipses is invoked accordingly, and only relevant for models with expert network covariates and multivariate responses.
border.col: A vector of length 5 (or 1) containing border colours for plots against the MAP classification, response vs. response, covariate vs. response, response vs. covariate, and covariate vs. covariate panels, respectively.
Defaults to c("purple", "black", "brown", "brown", "navy").
bg.col: A vector of length 5 (or 1) containing background colours for plots against the MAP classification, response vs. response, covariate vs. response, response vs. covariate, and covariate vs. covariate panels, respectively.
Defaults to c("cornsilk", "white", "palegoldenrod", "palegoldenrod", "cornsilk").
outer.margins: A list of length 4 with units as components named bottom, left, top, and right, giving the outer margins; the defaults uses two lines of text. A vector of length 4 with units (ordered properly) will work, as will a vector of length 4 with numeric variables (interpreted as lines). May need to be increased to accommodate outer labels in some cases.
outer.labels: The default is typically NULL, for alternating labels around the perimeter. If "all", all labels are printed, and if "none", no labels are printed. If subset$submat="upper" or subset$submat="lower", outer.labels instead defaults to "all".
Note that axis labels always correspond to the range of the depicted variable, and thus should not be interpreted as indicating counts or densities for the diagonal panels when diag.pars$show.hist=TRUE &/or diag.pars$show.dens=TRUE.
outer.rot: A 2-vector (x, y) rotating the top/bottom outer labels x degrees and the left/right outer labels y degrees. Only works for categorical labels of boxplot, mosaic, strip plot, and violin plot panels. Defaults to c(0, 90). Reordering via data.ind or cov.ind may improve appearance of outer labels in some cases.
gap: The gap between the tiles; defaulting to 0.05 of the width of a tile.
buffer: The fraction by which to expand the range of quantitative variables to provide plots that will not truncate plotting symbols. Defaults to 0.025, i.e. 2.5 percent of the range. Particularly useful when ellipses are drawn (see addEllipses) to ensure ellipses are visible in full.
uncert.cov: A logical indicating whether the expansion factor for points on plots involving covariates should also be modified when response.type="uncertainty". Defaults to FALSE, and only relevant for scatterplot and strip plot panels. When TRUE, scatter.pars$uncert.pch is invoked as the plotting symbols for covariate-related scatterplot and strip plot panels, otherwise scatter.pars$scat.pch and stripplot.pars$strip.pch are invoked for such panels.
scatter.pars: A list supplying select parameters for the continuous vs. continuous scatterplots.
where scat.pch, scat.col, and scat.size give the plotting symbols, colours, and sizes of the points in scatterplot panels, respectively. Note that eci.col gives both a) the colour of the fitted lines &/or confidence intervals for expert-related panels when scatter.type is one of "ci" or "lm" and b) the colour of the ellipses (if any) when addEllipses is one of "outer", "inner", or "both" and the response data is multivariate. Note that eci.col will inherit a suitable default from scat.col instead if the latter is supplied but the former is not.
Note also that scat.size will be modified on an observation-by-observation level when response.type is "uncertainty". Furthermore, note that the behaviour for plotting symbols when response.type="uncertainty" changes compared to response.type="points" depending on the value of the uncert.cov argument above. uncert.pch gives the plotting symbol used for all scatterplot (and strip plot) panels when response.type="uncertainty" and uncert.cov is TRUE. However, when uncert.cov is FALSE, scat.pch is invoked for scatterplots involving covariates and uncert.pch is used for panels involving only response variables. Finally, noise.size can be used to modify scat.size for observations assigned to the noise component (if any), but only when response.type="points".
density.pars: A list supplying select parameters for visualising the bivariate parametric density contours, only when response.type is "density".
where grid.size is a vector of length two giving the number of points in the x & y directions of the grid over which the density is evaluated, respectively (though density.pars$grid.size can also be supplied as a scalar, which will be automatically recycled to a vector of length 2), and dcol is either a single colour or a vector of length nlevels colours. dens.points governs whether points should be overlaid when response.type="density" (in other words, dens.points=TRUE is akin to specifying response.type="points" and response.type="density" simultaneously) and show.labels governs whether the density contours should be labelled. Note that contours are not labelled when dens.points=TRUE by default. Finally, label.style can take the values "mixed", "flat", or "align".
diag.pars: A list supplying select parameters for panels along the diagonal.
where hist.color is a vector of length 4, giving the colours for the response variables, gating covariates, expert covariates, and covariates entering both networks, respectively. By default, diagonal panels for response variables are ifelse(diag.pars$show.dens, "white", "black") and covariates of any kind are "dimgrey". hist.color also governs the outer colour for mosaic panels and the fill colour for boxplot, and violin panels (except for those involving the MAP classification; see boxplot.pars below). However, in the case of response vs. (categorical) covariates boxplots and violin plots, the fill colour is always "white". The MAP classification is always coloured by cluster membership, by default. The argument show.counts is only relevant for categorical variables.
The argument show.dens toggles whether parametric density estimates are drawn over the diagonal panels for each response variable. When show.dens=TRUE, the component densities are shown via thin lines, with colours given by scatter.pars$scat.col, while a thick "black" line is used for the overall mixture density. This argument can be used with or without show.hist also being TRUE. Finally, the grid size when show.dens=TRUE is given by diag.grid=100 by default. As per response.type="density", plotting is liable to be a little slower when show.dens=TRUE for models with expert network covariates. This is why show.dens=FALSE by default; otherwise it is recommended to be set to TRUE.
When diagonal=TRUE (the default), the diagonal from the top left to the bottom right is used for displaying the marginal distributions of variables (via histograms, with or without overlaid density estimates, or barplots, as appropriate). Specifying diagonal=FALSE will place the diagonal running from the top right down to the bottom left (with subset$submat accounted for accordingly).
stripplot.pars: A list supplying select parameters for continuous vs. categorical panels when one or both of the entries of conditional is "stripplot".
where strip.size and size.noise retain the definitions for the similar arguments under scatter.pars above. However, stripplot.pars$size.noise is invoked regardless of the response.type (in contrast to scatter.pars$noise.size). Notably, strip.col will inherit a suitable default from scatter.pars$scat.col if the latter is supplied but the former is not. Note also that the strip.pch default is modified to scatter.pars$uncert.pch if uncert.cov is TRUE.
boxplot.pars: A list supplying select parameters for continuous vs. categorical panels when one or both of the entries of conditional is "boxplot" or "violin".
All of the above are relevant for "boxplot" panels, are passed to panel.bwplot when producing boxplots, and retain the same definitions as the similarly named arguments therein. However, only box.col, varwidth, and box.fill are relevant for "violin" panels, and in both cases box.fill is only invoked for panels where the categorical variable is the MAP classification (i.e. when subset$show.map=TRUE). See diag.pars$hist.color for controlling the colours of non-MAP-related boxplot/violin panels. Notably, box.fill will inherit a suitable default from scatter.pars$scat.col if the latter is supplied but the former is not.
barcode.pars: A list supplying select parameters for continuous vs. categorical panels when one or both of the entries of conditional is "barcode".
where bar.col will inherit a suitable default from scatter.pars$scat.col if the latter is supplied but the former is not. See the help file for barcode::barcode for details on the remaining arguments. Note that the arguments ptsize and ptpch, which are only relevant when use.points=TRUE are given by the corresponding scatter.pars$scat.size/scatter.pars$noise.size and scatter.pars$scat.pch arguments, by default.
mosaic.pars: A list supplying select parameters for categorical vs. categorical panels (if any).
The current default arguments and values thereof are passed through to strucplot for producing mosaic tiles. When shade is not FALSE, mfill is a logical which governs the colouring scheme for panels (if any) involving the MAP classification. When mfill is TRUE (the default), gp is invoked here in such a way that tiles will inherit appropriate interior colours via gp$fill from mcol and a "black" outer colour via gp$col. When mfill is FALSE, or the panel involves two categorical covariates, the outer colours are inherited from mcol and the interior fill colour is inherited from bg.col. See diag.pars$hist.color for controlling the interior fill colour of non-MAP-related mosaic panels. Notably, mcol will inherit a suitable default from scatter.pars$scat.col if the latter is supplied but the former is not.
axis.pars: A list supplying select parameters for controlling the axes.
NULL is equivalent to:
list(n.ticks=5, axis.fontsize=9).
The argument n.ticks will be overwritten for categorical variables with fewer than 5 levels.
...: Catches unused arguments. Alternatively, named arguments can be passed directly here to any/all of scatter.pars, density.pars, diag.pars, stripplot.pars, boxplot.pars, barcode.pars, mosaic.pars, and axis.pars.
Returns
A generalised pairs plot showing all pairwise relationships between clustered response variables and associated gating &/or expert network continuous &/or categorical variables, coloured according to the MAP classification, with the marginal distributions of each variable along the diagonal.
Note
plot.MoEClust is a wrapper to MoE_gpairs which accepts the default arguments, and also produces other types of plots. Caution is advised producing generalised pairs plots when the dimension of the data is large.
Note that all colour-related defaults in scatter.pars, stripplot.pars, barcode.pars, and mosaic.pars above assume a specific colour-palette (see mclust.options("classPlotColors")). Thus, for instance, specifying scatter.pars$scat.col=res$classification will produce different results compared to leaving this argument unspecified. This is especially true for models with a noise component, for which the default is handled quite differently (for one thing, res$G is the number of non-noise components). Similarly, all pch-related defaults in scatter.pars and stripplot.pars above assume a specific set of plotting symbols also (see mclust.options("classPlotSymbols")). Generally, all colour and symbol related arguments are strongly recommended to be left at their default values, unless being supplied as a single character string, e.g. "black" for colours. To help in this regard, colour-related arguments sensibly inherent their defaults from scatter.pars$scat.col if that is supplied and the argument in question is not.
Warning
For MoEClust models with more than one expert network covariate, fitted lines produced in continuous covariate vs. continuous response scatterplots via scatter.type="lm" or scatter.type="ci" will NOT correspond to the coefficients in the expert network (res$expert).
Caution is advised when producing "barcode" plots for the conditional panels. In some cases, resizing the graphics device after the production of the plot will result in distortion because of the way the rotation of non-horizontal barcodes is performed. Thus, when any(conditional == "barcode"), it is advisable to ensure the dimensions of the overall plot are square. Furthermore, such plots may not display correctly anyway in RStudio's ``Plots'' pane and so a different graphics device may need to be used (but not subsequently resized).
Caution is also advised when producing generalised pairs plots when the dimension of the data is large.
Examples
data(ais)res <- MoE_clust(ais[,3:7], G=2, gating=~ BMI, expert=~ sex, network.data=ais, modelNames="EVE")MoE_gpairs(res)# Produce the same plot, but with a violin plot in the lower triangle.# Colour the outline of the mosaic tiles rather than the interior using mfill.# Size points in the response vs. response panels by their clustering uncertainty.MoE_gpairs(res, conditional=c("stripplot","violin"), mfill=FALSE, response.type="uncertainty")# Instead show the bivariate density contours of the response variables (without labels).# (Plotting may be slow when response.type="density" for models with expert covariates.)# Use different colours for histograms of covariates in the gating/expert/both networks.# Also use different colours for response vs. covariate & covariate vs. response panels.MoE_gpairs(res, response.type="density", show.labels=FALSE, dens.points=TRUE, hist.color=c("black","cyan","hotpink","chartreuse"), bg.col=c("whitesmoke","white","mintcream","mintcream","floralwhite"))# Examine effect of expert.covar & diag.grid in conjunction with show.dens & show.histMoE_gpairs(res, show.dens=TRUE, expert.covar=FALSE, show.hist=FALSE, diag.grid=20)MoE_gpairs(res, show.dens=TRUE, expert.covar=TRUE, show.hist=TRUE, diag.grid=200)# Explore various options to subset and rearrange the panelsMoE_gpairs(res, data.ind=5:1, cov.ind=0, show.map=FALSE, show.hist=FALSE, submat="upper", diagonal=FALSE)# Produce a generalised pairs plot for a model with a noise component.# Reorder the covariates and omit the variables "Hc" and "Hg".# Use barcode plots for the categorical/continuous pairs.# Magnify the size of scatter points assigned to the noise component.resN <- MoE_clust(ais[,3:7], G=2, gating=~ SSF + Ht, expert=~ sex, network.data=ais, modelNames="EEE", tau0=0.1, noise.gate=FALSE)# Note that non-horizontal barcode panels may not display correctly in RStudio's "Plots" pane # it may be necessary to first open a new device:# dev.new()MoE_gpairs(resN, data.ind=c(1,2,5), cov.ind=c(3,1,2), use.points=TRUE, conditional="barcode", noise.size=grid::unit(0.5,"char"))# Plots can be modified to show only a single (diagonal) panel of interestMoE_gpairs(resN, data.ind=0, cov.ind=0)MoE_gpairs(resN, data.ind=0, cov.ind="sex", show.map=FALSE)MoE_gpairs(resN, data.ind="RCC", cov.ind=0, show.map=FALSE, show.dens=TRUE)
References
Murphy, K. and Murphy, T. B. (2020). Gaussian parsimonious clustering models with covariates and a noise component. Advances in Data Analysis and Classification, 14(2): 293-325. <tools:::Rd_expr_doi("10.1007/s11634-019-00373-8") >.
Emerson, J. W., Green, W. A., Schloerke, B., Crowley, J., Cook, D., Hofmann, H. and Wickham, H. (2013). The generalized pairs plot. Journal of Computational and Graphical Statistics, 22(1): 79-91.