Random oversampling of the minority group(s) or undersampling of the majority group to compensate for class imbalance in datasets.
randomsample(y, x, minor =NULL, major =1, yminor =NULL)
Arguments
y: Vector of response outcome as a factor
x: Matrix of predictors
minor: Amount of oversampling of the minority class. If set to NULL
then all classes will be oversampled up to the number of samples in the majority class. To turn off oversampling set minor = 1.
major: Amount of undersampling of the majority class
yminor: Optional character value specifying the level in y which is to be oversampled. If NULL, this is set automatically to the class with the smallest sample size.
Returns
List containing extended matrix x of synthesised data and extended response vector y
Details
minor < 1 and major > 1 are ignored.
Examples
## Imbalanced datasetset.seed(1,"L'Ecuyer-CMRG")x <- matrix(rnorm(150*2e+04),150,2e+04)#' predictorsy <- factor(rbinom(150,1,0.2))#' imbalanced binary responsetable(y)## first 30 parameters are weak predictorsx[,1:30]<- rnorm(150*30,0,1)+ as.numeric(y)*0.5## Balance x & y outside of CV loop by random oversampling minority groupout <- randomsample(y, x)y2 <- out$y
x2 <- out$x
table(y2)## Nested CV glmnet with unnested balancing by random oversampling on## whole datasetfit1 <- nestcv.glmnet(y2, x2, family ="binomial", alphaSet =1, cv.cores=2, filterFUN = ttest_filter)fit1$summary
## Balance x & y outside of CV loop by random oversampling minority groupout <- randomsample(y, x, minor=1, major=0.4)y2 <- out$y
x2 <- out$x
table(y2)## Nested CV glmnet with unnested balancing by random undersampling on## whole datasetfit1b <- nestcv.glmnet(y2, x2, family ="binomial", alphaSet =1, cv.cores=2, filterFUN = ttest_filter)fit1b$summary
## Balance x & y outside of CV loop by SMOTEout <- smote(y, x)y2 <- out$y
x2 <- out$x
table(y2)## Nested CV glmnet with unnested balancing by SMOTE on whole datasetfit2 <- nestcv.glmnet(y2, x2, family ="binomial", alphaSet =1, cv.cores=2, filterFUN = ttest_filter)fit2$summary
## Nested CV glmnet with nested balancing by random oversamplingfit3 <- nestcv.glmnet(y, x, family ="binomial", alphaSet =1, cv.cores=2, balance ="randomsample", filterFUN = ttest_filter)fit3$summary
class_balance(fit3)## Plot ROC curvesplot(fit1$roc, col='green')lines(fit1b$roc, col='red')lines(fit2$roc, col='blue')lines(fit3$roc)legend('bottomright', legend = c("Unnested random oversampling","Unnested SMOTE","Unnested random undersampling","Nested balancing"), col = c("green","blue","red","black"), lty=1, lwd=2)