suggest_k function

Suggest k

Suggest k

Tool to help decide how many clusters to use for partition around medoids algorithm.

suggest_k( data, range = 3:10, samples_col = "Sample", abundance_col = "Abundance", index = "Average Silhouette Score", detailed = FALSE, ... )

Arguments

  • data: a data.frame with, at least, the classification, abundance and sample information for each phylogenetic unit.
  • range: The range of values of k to test, default is from 3 to 10.
  • samples_col: String with name of column with sample names.
  • abundance_col: string with name of column with abundance values. Default is "Abundance".
  • index: Index used to select best k. Can be one of: "Average Silhouette Score", "Davies-Bouldin" or "Calinski-Harabasz".
  • detailed: If False (default) returns an integer with best overall k. If TRUE, returns a list with full details.
  • ...: Extra arguments.

Returns

Integer indicating best k from selected index. Optionally, can return a list with details.

Details

The best k is selected for each sample, based on the selected index. If different k's are obtained for different samples (probable) then we calculate the mean value of k and return it as an integer. Alternatively, we can return a more detailed result in the form of a list.

Note : this function is used within define_rb(), with default parameters, for the optional automatic selection of k.

Detailed option

If detailed = TRUE, then the output is a list with information to help decide for k. More specifically, the list will include:

  • A data.frame summarizing what information each index provides and how to interpret the value.
  • A brief summary indicating the number of samples in the dataset and the range of k values used.
  • A data.frame with the best k for each sample, based on each index.

Automatic k selection

If detailed = FALSE, this function will provide a single integer with the best k. The default decision is based on the maximum average Silhouette score obtained for the values of k between 3 and 10. To better understand why the average Silhouette score and this range of k's were selected, we refer to Pascoal et al., 2025 and to vignette("explore-classifications").

Alternatively, this function can also provide the best k, as an integer, based on another index (Davies-Bouldin and Calinski-Harabasz) and can compare the entire of possible k's.

Examples

# Get the best k with default parameters suggest_k(nice_tidy) # Get detailed results to decide for yourself suggest_k(nice_tidy, detailed = TRUE, range = 2:7) # Get best k, based on Davies-Bouldin index suggest_k(nice_tidy, detailed = FALSE, index = "Davies-Bouldin")

See Also

evaluate_k(), evaluate_sample_k(), check_DB(), check_CH(), check_avgSil(), cluster::pam()

  • Maintainer: Francisco Pascoal
  • License: GPL (>= 3)
  • Last published: 2025-04-07