X: an n×k matrix where columns are k objects to be clustered, and each object contains n observations (objects could be a set of time series).
method: the clustering method to be used -- currently either TRUST if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Ciampi_etal_2010",package="funtimes",cached_env=.Rdpack.currefs)
or DBSCAN if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Ester_etal_1996",package="funtimes",cached_env=.Rdpack.currefs) . If the method is DBSCAN, then set MinPts and optimal ϵ is selected using DR. If the method is TRUST, then set theta, and optimal δ
is selected using DR.
minPts: the minimum number of samples in an ϵ-neighborhood of a point to be considered as a core point. The minPts is to be used only with the DBSCAN method. The default value is 3.
theta: connectivity parameter θ∈(0,1), which is to be used only with the TRUST method. The default value is 0.9.
B: number of random splits in calculating the Average Cluster Deviation (ACD). The default value is 500.
lb, ub: endpoints for a range of search for the optimal parameter.
Returns
A list containing the following components: - P_opt: the value of the optimal parameter. If the method is DBSCAN, then P_opt is optimal ϵ. If the method is TRUST, then P_opt is optimal δ.
ACD_matrix: a matrix that returns ACD for different values of a tuning parameter. If the method is DBSCAN, then the tuning parameter is ϵ. If the method is TRUST, then the tuning parameter is δ.
Details
Parameters lb,ub are endpoints for the search for the optimal parameter. The parameter candidates are calculated in a way such that P:=1.1x,x∈lb,lb+0.5,lb+1.0,...,ub. Although the default range of search is sufficiently wide, in some cases lb,ub can be further extended if a warning message is given.
For more discussion on properties of the considered clustering algorithms and the DR procedure see if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Huang_etal_2016;textual",package="funtimes",cached_env=.Rdpack.currefs)
and if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Huang_etal_2018_riding;textual",package="funtimes",cached_env=.Rdpack.currefs) .
Examples
## Not run:## example 1## use iris data to test DR proceduredata(iris)require(clue)# calculate NMI to compare the clustering result with the ground truthrequire(scatterplot3d)Data <- scale(iris[,-5])ground_truth_label <- iris[,5]# perform DR procedure to select optimal eps for DBSCAN # and save it in variable eps_opteps_opt <- DR(t(Data), method="DBSCAN", minPts =5)$P_opt
# apply DBSCAN with the optimal eps on iris data # and save the clustering result in variable resres <- dbscan(Data, eps = eps_opt, minPts =5)$cluster
# calculate NMI to compare the clustering result with the ground truth labelclue::cl_agreement(as.cl_partition(ground_truth_label), as.cl_partition(as.numeric(res)), method ="NMI")# visualize the clustering result and compare it with the ground truth result# 3D visualization of clustering result using variables Sepal.Width, Sepal.Length, # and Petal.Lengthscatterplot3d(Data[,-4],color = res)# 3D visualization of ground truth result using variables Sepal.Width, Sepal.Length,# and Petal.Lengthscatterplot3d(Data[,-4],color = as.numeric(ground_truth_label))## example 2## use synthetic time series data to test DR procedurerequire(funtimes)require(clue)require(zoo)# simulate 16 time series for 4 clusters, each cluster contains 4 time seriesset.seed(114)samp_Ind <- sample(12,replace=F)time_points <-30X <- matrix(0,nrow=time_points,ncol =12)cluster1 <- sapply(1:4,function(x) arima.sim(list(order = c(1,0,0), ar = c(0.2)), n = time_points, mean =0, sd =1))cluster2 <- sapply(1:4,function(x) arima.sim(list(order = c(2,0,0), ar = c(0.1,-0.2)), n = time_points, mean =2, sd =1))cluster3 <- sapply(1:4,function(x) arima.sim(list(order = c(1,0,1), ar = c(0.3), ma = c(0.1)), n = time_points, mean =6, sd =1))X[,samp_Ind[1:4]]<- t(round(cluster1,4))X[,samp_Ind[5:8]]<- t(round(cluster2,4))X[,samp_Ind[9:12]]<- t(round(cluster3,4))# create ground truth label of the synthetic dataground_truth_label = matrix(1, nrow =12, ncol =1)for(k in1:3){ ground_truth_label[samp_Ind[(4*k -4+1):(4*k)]]= k
}# perform DR procedure to select optimal delta for TRUST# and save it in variable delta_optdelta_opt <- DR(X, method ="TRUST")$P_opt
# apply TRUST with the optimal delta on the synthetic data # and save the clustering result in variable resres <- CSlideCluster(X, Delta = delta_opt, Theta =0.9)# calculate NMI to compare the clustering result with the ground truth labelclue::cl_agreement(as.cl_partition(as.numeric(ground_truth_label)), as.cl_partition(as.numeric(res)), method ="NMI")# visualize the clustering result and compare it with the ground truth result# visualization of the clustering result obtained by TRUSTplot.zoo(X, type ="l", plot.type ="single", col = res, xlab ="Time index", ylab ="")# visualization of the ground truth result plot.zoo(X, type ="l", plot.type ="single", col = ground_truth_label, xlab ="Time index", ylab ="")## End(Not run)