ann: Subjects' annotation data. An incidence matrix assigning subjects to classes of biological relevance. Meant to tune cluster assignation via Biological Homogeneity Index (BHI). If ann=NULL, the number of clusters is tuned with the Silhouette index instead of BHI. Defaults to NULL.
labels: Character vector with labels describing subjects. Meant to assign aesthetics to the visual display of clusters.
aest: Data frame containing points shape and color. Defaults to NULL.
eps_res: How many eps values should be explored between the specified range?
eps_range: Vector containing the minimum and maximum eps values to be explored. Defaults to c(0, 4).
min.clus.size: Minimum size for a cluster to appear in the visual display. Defaults to 10
group.names: The title for the legend's key if 'aest' is specified.
xlab: Name of the 'xlab'. Defaults to "x: tSNE(X)"
ylab: Name of the 'ylab'. Defaults to "y: tSNE(X)"
clus: Should we do clustering? Defaults to TRUE. If false, only point aesthetics are applied.
Returns
A list with the results of the DBSCAN clustering and (if argument 'plot'=TRUE) the corresponding graphical displays.
dbscan.res: a list with the results of the (sparse) SVD, containing:
cluster: Cluster partition.
eps: Optimal eps according to the Silhouette or Biological Homogeneity indexes criteria.
SIL: Maximum peak in the trajectory of the Silhouette index.
BHI: Maximum peak in the trajectory of the Biological Homogeneity index.
clusters.plot: A ggplot object with the clusters' graphical display.
Details
The function takes the outcome of pca2tsne (or a list containing any two-columns matrix) and finds clusters via DBSCAN. It extends code from the MEREDITH (Taskesen et al. 2016) and clValid (Datta & Datta, 2018) R packages to tune DBSCAN parameters with Silhouette or Biological Homogeneity indexes.
Examples
library(MOSS)library(viridis)library(cluster)library(annotate)# Using the 'iris' data tow show cluster definition via BHI criterion.set.seed(42)data(iris)# Scaling columns.X <- scale(iris[,-5])# Calling pca2tsne to map the three variables onto a 2-D map.Z <- pca2tsne(X, perp =30, n.samples =1, n.iter =1000)# Using 'species' as previous knoledge to identify clusters.ann <- model.matrix(~-1+ iris[,5])# Getting clusters.tsne2clus(Z, ann = ann, labels = iris[,5], aest = aest.f(iris[,5]), group.names ="Species", eps_range = c(0,3))# Example of usage within moss.set.seed(43)sim_blocks <- simulate_data()$sim_blocks
out <- moss(sim_blocks[-4], tSNE =TRUE, cluster = list(eps_range = c(0,4), eps_res =100, min_clus_size =1), plot =TRUE)out$clus_plot
out$clusters_vs_PCs
References
Ester, Martin, Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. 1996. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," 226_231.
Hahsler, Michael, and Matthew Piekenbrock. 2017. "Dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms." https://cran.r-project.org/package=dbscan.
Datta, Susmita, and Somnath Datta. 2006. Methods for Evaluating Clustering Algorithms for Gene Expression Data Using a Reference Set of Functional Classes. BMC Bioinformatics 7 (1). BioMed Central:397.
Taskesen, Erdogan, Sjoerd M. H. Huisman, Ahmed Mahfouz, Jesse H. Krijthe, Jeroen de Ridder, Anja van de Stolpe, Erik van den Akker, Wim Verheagh, and Marcel J. T. Reinders. 2016. Pan-Cancer Subtyping in a 2D-Map Shows Substructures That Are Driven by Specific Combinations of Molecular Characteristics. Scientific Reports 6 (1):24949.