Predicting new observations with a previously fitted BART model
Predicting new observations with a previously fitted BART model
BART is a Bayesian sum-of-trees model.
For a numeric response y, we have y=f(x)+e, where eN(0,sigma2).
f is the sum of many tree models. The goal is to have very flexible inference for the uknown function f.
In the spirit of ensemble models , each tree is constrained by a prior to be a weak learner so that it contributes a small amount to the overall fit.
## S3 method for class 'recurbart'predict(object, newdata, mc.cores=1, openmp=(mc.cores.openmp()>0),...)
Arguments
object: object returned from previous BART fit with recur.bart
or mc.recur.bart.
newdata: Matrix of covariates to predict the distribution of t.
mc.cores: Number of threads to utilize.
openmp: Logical value dictating whether OpenMP is utilized for parallel processing. Of course, this depends on whether OpenMP is available on your system which, by default, is verified with mc.cores.openmp.
...: Other arguments which will be passed on to pwbart.
Details
BART is an Bayesian MCMC method. At each MCMC interation, we produce a draw from the joint posterior (f,sigma)∥(x,y) in the numeric y case and just f in the binary y case.
Thus, unlike a lot of other modelling methods in R, we do not produce a single model object from which fits and summaries may be extracted. The output consists of values f∗(x) (and sigma∗ in the numeric case) where * denotes a particular draw. The x is either a row from the training data (x.train) or the test data (x.test).
Returns
Returns an object of type recurbart with predictions corresponding to newdata.
## load 20 percent random sampledata(xdm20.train)data(xdm20.test)data(ydm20.train)##test BART with token run to ensure installation works## with current technology even a token run will violate CRAN policy## set.seed(99)## post <- recur.bart(x.train=xdm20.train, y.train=ydm20.train,## nskip=1, ndpost=1, keepevery=1)## Not run:set.seed(99)post <- recur.bart(x.train=xdm20.train, y.train=ydm20.train)## larger data sets can take some time so, if parallel processing## is available, submit this statement instead## post <- mc.recur.bart(x.train=xdm20.train, y.train=ydm20.train,## mc.cores=8, seed=99)require(rpart)require(rpart.plot)dss <- rpart(post$yhat.train.mean~xdm20.train)rpart.plot(dss)## for the 20 percent sample, notice that the top splits## involve cci_pvd and n## for the full data set, notice that all splits## involve ca, cci_pud, cci_pvd, ins270 and n## (except one at the bottom involving a small group)## compare patients treated with insulin (ins270=1) vs## not treated with insulin (ins270=0)N.train <-50N.test <-50K <- post$K ## 798 unique time points## only testing set, i.e., remove training setxdm20.test. <- xdm20.test[N.train*K+(1:(N.test*K)),]xdm20.test. <- rbind(xdm20.test., xdm20.test.)xdm20.test.[,'ins270']<- rep(0:1, each=N.test*K)## multiple threads will be utilized if availablepred <- predict(post, xdm20.test., mc.cores=8)## create Friedman's partial dependence function for the## intensity/hazard by time and ins270NK.test <- N.test*K
M <- nrow(pred$haz.test)## number of MCMC samples, typically 1000RI <- matrix(0, M, K)for(i in1:N.test) RI <- RI+(pred$haz.test[,(N.test+i-1)*K+1:K]/ pred$haz.test[,(i-1)*K+1:K])/N.test
RI.lo <- apply(RI,2, quantile, probs=0.025)RI.mu <- apply(RI,2, mean)RI.hi <- apply(RI,2, quantile, probs=0.975)plot(post$times, RI.hi, type='l', lty=2, log='y', ylim=c(min(RI.lo,1/RI.hi), max(1/RI.lo, RI.hi)), xlab='t', ylab='RI(t, x)', sub='insulin(ins270=1) vs. no insulin(ins270=0)', main='Relative intensity of hospital admissions for diabetics')lines(post$times, RI.mu)lines(post$times, RI.lo, lty=2)lines(post$times, rep(1, K), col='darkgray')## RI for insulin therapy seems fairly constant with timemean(RI.mu)## End(Not run)