Function to generate data with n observations of a primary outcome Y, secondary outcome K, exposure X, and measured as well as unmeasured confounders L and U, where the primary outcome is a quantitative normally-distributed variable (setting = "GLM") or censored time-to-event outcome under an accelerated failure time (AFT) model (setting = "AFT"). Under the AFT setting, the observed time-to-event variable T=exp(Y)
as well as the censoring indicator C are also computed. X
is generated as a genetic exposure variable in the form of a single nucleotide variant (SNV) in 0-1-2 additive coding with minor allele frequency maf. X can be generated independently of U
(X_orth_U = TRUE) or dependent on U
(X_orth_U = FALSE). For more details regarding the underlying model, see the vignette.
generate_data(setting ="GLM", n =1000, maf =0.2, cens =0.3, a =NULL, b =NULL, aXK =0.2, aXY =0.1, aXL =0, aKY =0.3, aLK =0, aLY =0, aUY =0, aUL =0, mu_X =NULL, sd_X =NULL, X_orth_U =TRUE, mu_U =0, sd_U =1, mu_K =0, sd_K =1, mu_L =0, sd_L =1, mu_Y =0, sd_Y =1)
Arguments
setting: String with value "GLM" or "AFT" indicating whether the primary outcome is generated as a normally-distributed quantitative outcome ("GLM") or censored time-to-event outcome ("AFT").
n: Numeric. Sample size.
maf: Numeric. Minor allele frequency of the genetic exposure variable.
cens: Numeric. Desired percentage of censored individuals and has to be specified under the AFT setting. Note that the actual censoring rate is generated through specification of the parameters a and b, and cens is mostly used as a check whether the desired censoring rate is obtained through a
and b (otherwise, a warning is issued).
a: Integer for generating the desired censoring rate under the AFT setting. Has to be specified under the AFT setting.
b: Integer for generating the desired censoring rate under the AFT setting. Has to be specified under the AFT setting.
aXK: Numeric. Size of the effect of X on K.
aXY: Numeric. Size of the effect of X on Y.
aXL: Numeric. Size of the effect of X on L.
aKY: Numeric. Size of the effect of K on Y.
aLK: Numeric. Size of the effect of L on K.
aLY: Numeric. Size of the effect of L on Y.
aUY: Numeric. Size of the effect of U on Y.
aUL: Numeric. Size of the effect of U on L.
mu_X: Numeric. Expected value of X.
sd_X: Numeric. Standard deviation of X.
X_orth_U: Logical. Indicator whether X should be generated independently of U (X_orth_U = TRUE) or dependent on U (X_orth_U = FALSE).
mu_U: Numeric. Expected value of U.
sd_U: Numeric. Standard deviation of U.
mu_K: Numeric. Expected value of K.
sd_K: Numeric. Standard deviation of K.
mu_L: Numeric. Expected value of L.
sd_L: Numeric. Standard deviation of L.
mu_Y: Numeric. Expected value of Y.
sd_Y: Numeric. Standard deviation of Y.
Returns
A dataframe containing n observations of the variables Y, K, X, L, U. Under the AFT setting, T=exp(Y) and the censoring indicator C (0 = censored, 1 = uncensored) are also computed.
Examples
# Generate data under the GLM setting with default valuesdat_GLM <- generate_data()head(dat_GLM)# Generate data under the AFT setting with default valuesdat_AFT <- generate_data(setting ="AFT", a =0.2, b =4.75)head(dat_AFT)