ntree: The number of trees to grow in the forest. The default value is 500.
replace: An indicator of whether sampling of training data is with replacement. The default value is TRUE.
sampsize: The size of total samples to draw for the training data. If sampling with replacement, the default value is the length of the training data. If samplying without replacement, the default value is two-third of the length of the training data.
sample.fraction: if this is given, then sampsize is ignored and set to be round(length(y) * sample.fraction). It must be a real number between 0 and 1
mtry: The number of variables randomly selected at each split point. The default value is set to be one third of total number of features of the training data.
nodesizeSpl: Minimum observations contained in terminal nodes. The default value is 3.
nodesizeAvg: Minimum size of terminal nodes for averaging dataset. The default value is 3.
nodesizeStrictSpl: Minimum observations to follow strictly in terminal nodes. The default value is 1.
nodesizeStrictAvg: Minimum size of terminal nodes for averaging dataset to follow strictly. The default value is 1.
minSplitGain: Minimum loss reduction to split a node further in a tree.
maxDepth: Maximum depth of a tree. The default value is 99.
interactionDepth: All splits at or above interaction depth must be on variables that are not weighting variables (as provided by the interactionVariables argument)
interactionVariables: Indices of weighting variables.
featureWeights: (optional) vector of sampling probablities/weights for each feature used when subsampling mtry features at each node above or at interactionDepth. The default is to use uniform probabilities.
deepFeatureWeights: used in place of featureWeights for splits below interactionDepth.
observationWeights: These denote the weights for each training observation which determines how likely the observation is to be selected in each bootstrap sample. This option is not allowed when sampling is done without replacement.
splitratio: Proportion of the training data used as the splitting dataset. It is a ratio between 0 and 1. If the ratio is 1, then essentially splitting dataset becomes the total entire sampled set and the averaging dataset is empty. If the ratio is 0, then the splitting data set is empty and all the data is used for the averaging data set (This is not a good usage however since there will be no data available for splitting).
seed: random seed
verbose: if training process in verbose mode
nthread: Number of threads to train and predict the forest. The default number is 0 which represents using all cores.
splitrule: only variance is implemented at this point and it contains specifies the loss function according to which the splits of random forest should be made
middleSplit: if the split value is taking the average of two feature values. If false, it will take a point based on a uniform distribution between two feature values. (Default = FALSE)
maxObs: The max number of observations to split on
linear: Fit the model with a ridge regression or not
linFeats: Specify which features to split linearly on when using linear (defaults to use all numerical features)
monotonicConstraints: Specifies monotonic relationships between the continuous features and the outcome. Supplied as a vector of length p with entries in 1,0,-1 which 1 indicating an increasing monotonic relationship, -1 indicating a decreasing monotonic relationship, and 0 indicating no relationship. Constraints supplied for categorical will be ignored.
overfitPenalty: Value to determine how much to penalize magnitude of coefficients in ridge regression
doubleTree: if the number of tree is doubled as averaging and splitting data can be exchanged to create decorrelated trees. (Default = FALSE)
reuseforestry: pass in an forestry object which will recycle the dataframe the old object created. It will save some space working on the same dataset.
savable: If TRUE, then RF is created in such a way that it can be saved and loaded using save(...) and load(...). Setting it to TRUE (default) will, however, take longer and it will use more memory. When training many RF, it makes a lot of sense to set this to FALSE to save time and memory.
saveable: deprecated. Do not use.
Returns
A forestry object.
Note
Treatment of missing data
When training the forest, if a splitting feature is missing for an observation, we assign that observation to the child node which has an average y closer to the observed y of the observation with the missing feature, and record how many observations with missingness went to each child.
At predict time, if there were missing observations in a node at training time, we randomly assign an observation with a missing feature to a child node with probability proportional to the number of observations with a missing splitting variable that went to each child at training time. If there was no missingness at training time, we assign to the child nodes with probability proportional to the number of observations in each child node.
This procedure is a generalization of the usual recommended approach to missingness for forests---i.e., at each point add a decision to send the NAs to the left, right or to split on NA versus no NA. This usual recommendation is heuristically equivalent to adding an indicator for each feature plus a recoding of each missing variable where the missigness is the maximum and then the minimum observed value. This recommendation, however, allows the method to pickup time effects for when variables are missing because of the indicator. We, therefore, do not allow splitting on NAs. This should increase MSE in training but hopefully allows for better learning of universal relationships. Importantly, it is straightforward to show that our approach is weakly dominant in expected MSE to the always left or right approach. We should also note that almost no software package actually implements even the usual recommended approach---e.g., ranger does not.
In version 0.8.2.09, the procedure for identifying the best variable to split on when there is missing training data was modified. Previously candidate variables were evaluated by computing the MSE taken over all observations, including those for which the splitting variable was missing. In the current implementation we only use observations for which the splitting variable is not missing. The previous approach was biased towards splitting on variables with missingness because observations with a missing splitting variable are assigned to the leaf that minimized the MSE.