This function implements Q-learning for estimating general K-stage DTRs. Lasso penalty can be applied for variable selection at each stage.
ql(H, AA, RR, K, pi='estimated', lasso=TRUE, m=4)
Arguments
H: subject history information before treatment for all subjects at the K stages. It can be a vector or a matrix when only baseline information is used in estimating the DTR; otherwise, it would be a list of length K. Please standardize all the variables in H to have mean 0 and standard deviation 1 before using H as the input. See details for how to construct H.
AA: observed treatment assignments for all subjects at the K stages. It is a vector if K=1, or a list of K vectors corresponding to the K stages.
RR: observed reward outcomes for all subjects at the K stages. It is a vector if K=1, or a list of K vectors corresponding to the K stages.
K: number of stages
pi: treatment assignment probabilities of the observed treatments for all subjects at the K stages. It is a vector if K=1, or a list of K vectors corresponding to the K stages. It can be a user specified input if the treatment assignment probabilities are known. The default is pi="estimated", that is we estimate the treatment assignment probabilities based on lasso-penalized logistic regressions with Hk being the predictors at each stage k.
lasso: specifies whether to add lasso penalty at each stage when fitting the model. The default is lasso=TRUE.
m: number of folds in the m-fold cross validation. It is used when res.lasso=T is specified. The default is m=4.
Details
A patient's history information prior to the treatment at stage k can be constructed recursively as Hk=(Hk−1,Ak−1,Rk−1,Xk) with H1=X1, where Xk is subject-specific variables collected at stage k just prior to the treatment, Ak is the treatment at stage k, and Rk is the outcome observed post the treatment at stage k. Higher order or interaction terms can also be easily incorporated in Hk, e.g., Hk=(Hk−1,Ak−1,Rk−1,Xk,Hk−1Ak−1,Rk−1Ak−1,XkAk−1).
Returns
A list of results is returned as an object. It contains the following attributes: - stage1: a list of stage 1 results, ...
stageK: a list of stage K results
valuefun: overall empirical value function under the estimated DTR
benefit: overall empirical benefit function under the estimated DTR
pi: treatment assignment probabilities of the assigned treatments for each subject at the K stages. If pi='estimated' is specified as input, the estimated treatment assignment probabilities from lasso-penalized logistic regressions will be returned.
In each stage's result, a list is returned which consists of - co: the estimated coefficients of (1,H,A,H∗A), the variables in the model at this stage
treatment: the estimated optimal treatment at this stage for each subject in the sample. If no tailoring variables are selected under lasso penalty, treatment will be assigned randomly with equal probability.
Q: the estimated optimal outcome increment from this stage to the end (the estimated optimal Q-function at this stage) for each subject in the sample
References
Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dissertation, University of Cambridge).
Qian, M., & Murphy, S. A. (2011). Performance guarantees for individualized treatment rules. Annals of statistics, 39(2), 1180.