Fitting semi-parametric shared frailty models with the EM algorithm
Fitting semi-parametric shared frailty models with the EM algorithm
emfrail(formula, data, distribution = emfrail_dist(), control = emfrail_control(), model =FALSE, model.matrix =FALSE,...)
Arguments
formula: A formula that contains on the left hand side an object of the type Surv
and on the right hand side a +cluster(id) statement. Two special statments may also be used: +strata() for specifying a grouping column that will represent different strata and +terminal()
data: A data.frame in which the formula argument can be evaluated
distribution: An object as created by emfrail_dist
control: An object as created by emfrail_control
model: Logical. Should the model frame be returned?
model.matrix: Logical. Should the model matrix be returned?
...: Other arguments, currently used to warn about deprecated argument names
Returns
An object of class emfrail that contains the following fields: - coefficients: A named vector of the estimated regression coefficients
hazard: The breslow estimate of the baseline hazard at each event time point, in chronological order
var: The variance-covariance matrix corresponding to the coefficients and hazard, assuming θ constant
var_adj: The variance-covariance matrx corresponding to the coefficients and hazard, adjusted for the estimation of theta
logtheta: The logarithm of the point estimate of θ. For the gamma and PVF family of distributions, this is the inverse of the estimated frailty variance.
var_logtheta: The variance of the estimated logarithm of θ
ci_logtheta: The likelihood-based 95% confidence interval for the logarithm of θ
frail: The posterior (empirical Bayes) estimates of the frailty for each cluster
residuals: A list with two elements, cluster which is a vector that the sum of the cumulative hazards from each cluster for a frailty value of 1, and individual, which is a vector that contains the cumulative hazard corresponding to each row of the data, multiplied by the corresponding frailty estimate
tev: The time points of the events in the data set, this is the same length as hazard
nevents_id: The number of events for each cluster
loglik: A vector of length two with the log-likelihood of the starting Cox model and the maximized log-likelihood
ca_test: The results of the Commenges-Andersen test for heterogeneity
cens_test: The results of the test for dependence between a recurrent event and a terminal event, if the +terminal() statement is specified and the frailty distribution is gamma
zph: The result of cox.zph called on a model with the estimated log-frailties as offset
formula, distribution, control: The original arguments
nobs, fitted: Number of observations and fitted values (i.e. zexp(βTx))
mf: The model.frame, if model = TRUE
mm: The model.matrix, if model.matrix = TRUE
Details
The emfrail function fits shared frailty models for processes which have intensity
λ(t)=zλ0(t)exp(β′x)
with a non-parametric (Breslow) baseline intensity λ0(t). The outcome (left hand side of the formula) must be a Surv object.
If the object is Surv(tstop, status) then the usual failure time data is represented. Gap-times between recurrent events are represented in the same way. If the left hand side of the formula is created as Surv(tstart, tstop, status), this may represent a number of things: (a) recurrent events episodes in calendar time where a recurrent event episode starts at tstart and ends at tstop
(b) failure time data with time-dependent covariates where tstop is the time of a change in covariates or censoring (status = 0) or an event time (status = 1) or (c) clustered failure time with left truncation, where tstart is the individual's left truncation time. Unlike regular Cox models, a major distinction is that in case (c) the distribution of the frailty must be considered conditional on survival up to the left truncation time.
The +cluster() statement specified the column that determines the grouping (the observations that share the same frailty). The +strata() statement specifies a column that determines different strata, for which different baseline hazards are calculated. The +terminal specifies a column that contains an indicator for dependent censoring, and then performs a score test
The distribution argument must be generated by a call to emfrail_dist. This determines the frailty distribution, which may be one of gamma, positive stable or PVF (power-variance-function), and the starting value for the maximum likelihood estimation. The PVF family also includes a tuning parameter that differentiates between inverse Gaussian and compound Poisson distributions. Note that, with univariate data (at most one event per individual, no clusters), only distributions with finite expectation are identifiable. This means that the positive stable distribution should have a maximum likelihood on the edge of the parameter space (theta=+inf, corresponding to a Cox model for independent observations).
The control argument must be generated by a call to emfrail_control. Several parameters may be adjusted that control the precision of the convergenge criteria or supress the calculation of different quantities.
Note
Several options in the control arguemnt shorten the running time for emfrail significantly. These are disabling the adjustemnt of the standard errors (se_adj = FALSE), disabling the likelihood-based confidence intervals (lik_ci = FALSE) or disabling the score test for heterogeneity (ca_test = FALSE).
The algorithm is detailed in the package vignette. For the gamma frailty, the results should be identical with those from coxph with ties = "breslow".
Examples
m_gamma <- emfrail(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats)# Inverse Gaussian distributionm_ig <- emfrail(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats, distribution = emfrail_dist(dist ="pvf"))# for the PVF distribution with m = 0.75m_pvf <- emfrail(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats, distribution = emfrail_dist(dist ="pvf", pvfm =0.75))# for the positive stable distributionm_ps <- emfrail(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats, distribution = emfrail_dist(dist ="stable"))## Not run:# Compare marginal log-likelihoodsmodels <- list(m_gamma, m_ig, m_pvf, m_ps)
models
logliks <- lapply(models, logLik)names(logliks)<- lapply(models,function(x) with(x$distribution, ifelse(dist =="pvf", paste(dist,"/", pvfm), dist)))
logliks
## End(Not run)# Stratified analysis## Not run: m_strat <- emfrail(formula = Surv(time, status)~ rx + strata(sex)+ cluster(litter), data = rats)## End(Not run)# Test for conditional proportional hazards (log-frailty as offset)## Not run:m_gamma <- emfrail(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats, control = emfrail_control(zph =TRUE))par(mfrow = c(1,2))plot(m_gamma$zph)## End(Not run)# Draw the profile log-likelihood## Not run: fr_var <- seq(from =0.01, to =1.4, length.out =20)# For gamma the variance is 1/theta (see parametrizations) pll_gamma <- emfrail_pll(formula = Surv(time, status)~ rx + sex + cluster(litter), data = rats, values =1/fr_var ) plot(fr_var, pll_gamma, type ="l", xlab ="Frailty variance", ylab ="Profile log-likelihood")# Recurrent events mod_rec <- emfrail(Surv(start, stop, status)~ treatment + cluster(id), bladder1)# The warnings appear from the Surv object, they also appear in coxph. plot(mod_rec, type ="hist")## End(Not run)# Left truncation## Not run:# We simulate some data with truncation times set.seed(2018) nclus <-300 nind <-5 x <- sample(c(0,1), nind * nclus,TRUE) u <- rep(rgamma(nclus,1,1), each =3) stime <- rexp(nind * nclus, rate = u * exp(0.5* x)) status <- ifelse(stime >5,0,1) stime[status ==0]<-5# truncate uniform between 0 and 2 ltime <- runif(nind * nclus, min =0, max =2) d <- data.frame(id = rep(1:nclus, each = nind), x = x, stime = stime, u = u, ltime = ltime, status = status) d_left <- d[d$stime > d$ltime,] mod <- emfrail(Surv(stime, status)~ x + cluster(id), d)# This model ignores the left truncation, 0.378 frailty variance: mod_1 <- emfrail(Surv(stime, status)~ x + cluster(id), d_left)# This model takes left truncation into account,# but it considers the distribution of the frailty unconditional on the truncation mod_2 <- emfrail(Surv(ltime, stime, status)~ x + cluster(id), d_left)# This is identical with: mod_cox <- coxph(Surv(ltime, stime, status)~ x + frailty(id), data = d_left)# The correct thing is to consider the distribution of the frailty given the truncation mod_3 <- emfrail(Surv(ltime, stime, status)~ x + cluster(id), d_left, distribution = emfrail_dist(left_truncation =TRUE)) summary(mod_1) summary(mod_2) summary(mod_3)## End(Not run)
References
Balan TA, Putter H (2019) "frailtyEM: An R Package for Estimating Semiparametric Shared Frailty Models", Journal of Statistical Software 90 (7) 1-29. doi:10.18637/jss.v090.i07
See Also
plot.emfrail and autoplot.emfrail for plot functions directly available, emfrail_pll for calculating L(θ) at specific values of θ, summary.emfrail for transforming the emfrail object into a more human-readable format and for visualizing the frailty (empirical Bayes) estimates, predict.emfrail for calculating and visalizing conditional and marginal survival and cumulative hazard curves. residuals.emfrail for extracting martingale residuals and logLik.emfrail for extracting the log-likelihood of the fitted model.