Post-processing use OOB votes and predicted values to build more accurate estimates of Response values. Note that post-processing can not ensure that new estimate will have a lower error. It works for many cases but not all.
object: a randomUniformForest object with OOB data.
nbModels: how many models to build for new estimates. Usually one is enough.
idx: how many values to choose in OOB model for each new predicted value. Usually one is enough.
granularity: degree of precision needed for each old estimate value. Usually one is enough.
predObject: if current model is built with full sample, then using an old model 'predObject' (a randomUniformForest object) that have OOB data can help to reduce error. Must be used with 'swapPredictions = TRUE'
swapPredictions: set it to TRUE if two models, current one without OOB data and old one with OOB data, have to be used for trying to reduce prediction error.
X: not currently used.
Xtest: test data in the case of regression, for a more friendly output of the model.
imbalanced: if TRUE, may improve metrics in the case of imbalanced datasets.
OOB: if FALSE, does not use OOB informations.
method: for classification, if expected bias is enough high, one may use it as a method to improve AUC. Otherwise, use the default one, 'cutoff'. Both tend to get the same results, despite a few tests for now, but 'bias' method seems more robust. 'residuals' is used in regression only as a powerful but computationally intensive method and replaces the default internal one.
keep2ndModel: if TRUE, and for regression, keep the model based on residuals for further modelling and predictions.
largeData: if TRUE, and for regression, use rUniformForest.big to compute model for the residuals.
...: arguments to use for the computation of the model from the residuals.
References
Xu, Ruo, Improvements to random forest methodology (2013). Graduate Theses and Dissertations. Paper 13052.
# Note that post-processing works better with enough trees, at least 100, and enough datan =200; p =20# Simulate 'p' gaussian vectors with random parameters between -10 and 10.X <- simulationData(n,p)# give names to featuresX <- fillVariablesNames(X)# Make a rule to create response vectorepsilon1 = runif(n,-1,1)epsilon2 = runif(n,-1,1)# a rule with many noise (only four variables are significant)Y =2*(X[,1]*X[,2]+ X[,3]*X[,4])+ epsilon1*X[,5]+ epsilon2*X[,6]# randomize then make train and test sampletwoSamples <- cut(sample(1:n,n),2, labels =FALSE)Xtrain = X[which(twoSamples ==1),]Ytrain = Y[which(twoSamples ==1)]Xtest = X[which(twoSamples ==2),]Ytest = Y[which(twoSamples ==2)]# compute an accurate model (in this case bagging and log data works best) and predictrUF.model <- randomUniformForest(Xtrain, Ytrain, xtest = Xtest, ytest = Ytest,bagging =TRUE, logX =TRUE, ntree =60, threads =2)# get mean squared errorrUF.model
# post processnewEstimate <- postProcessingVotes(rUF.model)# get mean squared errorsum((newEstimate - Ytest)^2)/length(Ytest)## regression do not use all data but sub-samples.## Comparing, when using full sample (but then, we do not have OOB data)# rUF.model.fullsample <- randomUniformForest(Xtrain, Ytrain, xtest = Xtest, ytest = Ytest,# subsamplerate = 1, bagging = TRUE, logX = TRUE, ntree = 60, threads = 2)# rUF.model.fullsample## Nevertheless we can use old model with OOB data to fit a new estimate## newEstimate.fullsample <- postProcessingVotes(rUF.model.fullsample,# predObject = rUF.model, swapPredictions = TRUE)## get mean squared error# sum((newEstimate.fullsample - Ytest)^2)/length(Ytest)