REplication of a Synthetic Minority Oversampling TEchnique for highly imbalanced datasets
REplication of a Synthetic Minority Oversampling TEchnique for highly imbalanced datasets
Produce new samples of the minority class by creating synthetic data of the original ones by randomization. After that, the new dataset contains all original data + synthetic data labelled with the minority class. Main argument is derived from Breiman's ideas on the construction of synthetic data for the unsupervised mode of random (Uniform) Forests. The new dataset preserves the distribution of the original dataset covariates(at least with default 'samplingMethod' option).
Y: labels (vector or factor) of the training sample.
majorityClass: label(s) of the majority class(es).
increaseMinorityClassTo: proportion of minority class observations that one wants to reach.
conditionalTo: the targeted variable (usually the response) that one needs the new (artificial) observations to be sampled. More precisely, synthetic dataset will be generated conditionally to each target value.
samplingFromGaussian: not used for now.
samplingMethod: which method to use for sampling from data? Note that one may use any of the two first methods with the 'with bootstrap' option, for example 'samplingMethod = c("uniform univariate sampling", "with bootstrap")' or 'samplingMethod = "uniform univariate sampling"'.
maxClasses: To be documented.
seed: seed to use when sampling distribution.
Details
The main purpose of the reSMOTE( ) function is to increase AUC. The function is specifically designed for random Uniform Forests with possible conjunction of options 'classwt', 'rebalancedsampling' and/or 'oversampling', also designed to treat imbalanced classes. Since ensemble models tend to choose majority class too quickly, reSMOTE introduces a perturbation that forces the algorithm to learn data more slowly. Note that the wanted proportion of minority class ('increaseMinorityClassTo' option) plays a key role and is probably model and data dependent. Note also that reSMOTE() function is its in early phase and is mainly designed for very imbalanced datasets and when others options find their limits.
Returns
A matrix or a data frame of the new training sample embedding original training sample and synthetic training sample. Labels are put in the last column, with the column name noted "Class" (if Y is a factor) or "Response" (if not).