stat_filter function

Univariate filter for binary classification with mixed predictor datatypes

Univariate filter for binary classification with mixed predictor datatypes

Univariate statistic filter for dataframes of predictors with mixed numeric and categorical datatypes. Different statistical tests are used depending on the data type of response vector and predictors:

  • Binary class response: bin_stat_filter(): t-test for continuous data, chi-squared test for categorical data
  • Multiclass response: class_stat_filter(): one-way ANOVA for continuous data, chi-squared test for categorical data
  • Continuous response: cor_stat_filter(): correlation (or linear regression) for continuous data and binary data, one-way ANOVA for categorical data
stat_filter(y, x, ...) bin_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) class_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) cor_stat_filter( y, x, cor_method = c("pearson", "spearman", "lm"), force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, rsq_method = "pearson", type = c("index", "names", "full", "list"), ... )

Arguments

  • y: Response vector
  • x: Matrix or dataframe of predictors
  • ...: optional arguments, e.g. rsq_method: see collinear().
  • force_vars: Vector of column names within x which are always retained in the model (i.e. not filtered). Default NULL means all predictors will be passed to filterFUN.
  • nfilter: Number of predictors to return. If NULL all predictors with p-values < p_cutoff are returned.
  • p_cutoff: p value cut-off
  • rsq_cutoff: r^2 cutoff for removing predictors due to collinearity. Default NULL means no collinearity filtering. Predictors are ranked based on t-test. If 2 or more predictors are collinear, the first ranked predictor by t-test is retained, while the other collinear predictors are removed. See collinear().
  • type: Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a dataframe of statistics, "list" returns a list of 2 matrices of statistics, one for continuous predictors, one for categorical predictors.
  • cor_method: For cor_stat_filter() only, either "pearson", "spearman" or "lm" controlling whether continuous predictors are filtered by correlation (faster) or regression (slower but allows inclusion of covariates via force_vars).
  • rsq_method: character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman". See collinear().

Returns

Integer vector of indices of filtered parameters (type = "index") or character vector of names (type = "names") of filtered parameters in order of test p-value. If type is "full" full output is returned containing a dataframe of statistical results. If type is "list" the output is returned as a list of 2 matrices containing statistical results separated by continuous and categorical predictors.

Details

stat_filter() is a wrapper which calls bin_stat_filter(), class_stat_filter() or cor_stat_filter() depending on whether y is binary, multiclass or continuous respectively. Ordered factors are converted to numeric (integer) levels and analysed as if continuous.

Examples

library(mlbench) data(BostonHousing2) dat <- BostonHousing2 y <- dat$cmedv ## continuous outcome x <- subset(dat, select = -c(cmedv, medv, town)) stat_filter(y, x, type = "full") stat_filter(y, x, nfilter = 5, type = "names") stat_filter(y, x) data(iris) y <- iris$Species ## 3 class outcome x <- subset(iris, select = -Species) stat_filter(y, x, type = "full")
  • Maintainer: Myles Lewis
  • License: MIT + file LICENSE
  • Last published: 2025-03-10