Estimate the connectivity matrix of a causal graph
Estimate the connectivity matrix of a causal graph
Estimates the connectivity matrix of a directed causal graph, using various possible methods. Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC.
X: A (n x p)-data matrix with n observations of p variables.
environment: An optional vector of length n, where the entry for observation i is an index for the environment in which observation i took place (Simplest case: entries 1 for observational data and entries 2 for interventional data of unspecified type. Encoding for observational data can be changed with indexObservationalData). Is required for methods ICP, hiddenICP and backShift.
interventions: A optional list of length n. The entry for observation i is a numeric vector that specifies the variables on which interventions happened for observation i (a scalar if an intervention happened on just one variable and integer(0) if no intervention occured for this observation). Is used for methods gies, rankGies and CAM and will generate the vector environment if the latter is set to NULL. (However, this might generate too many different environments for some data sets, so a hand-picked vector environment is preferable). Is also used for ICP and hiddenICP to exclude interventions on the target variable of interest.
parentsOf: The variables for which we would like to estimate the parents. Default are all variables. Currently only used with mode = "raw". Speeds up computation for methods bivariateANM, bivariateCAM, ICP, hiddenICP and regression; for other methods only affects output. Also see variableSelMat for possibly speeding up computational time by restricting the set of potential parents for a variable.
method: A string that specfies the method to use. The methods pc (PC-algorithm), LINGAM (LINGAM), arges (Adaptively restricted greedy equivalence search), ges
and rfci (Really fast causal inference) are imported from the package "pcalg" and are documented there in more detail, including the additional options that can be supplied via setOptions. The "rank versions" of arges, fci, ges, gies and pc are based on [1]. The method CAM (Causal additive models) is documented in the package "CAM" and the methods ICP (Invariant causal prediction), hiddenICP
(Invariant causal prediction with hidden variables) are from the package "InvariantCausalPrediction". The method backShift comes from the package "backShift". The method mmhc comes from the package "bnlearn". Finally, the methods bivariateANM and bivariateCAM are for now implemented internally but will hopefully be part of another package at some point in the near future.
alpha: The level at which tests are done. This leads to confidence intervals for ICP and hiddenICP and is used internally for pc, rankPc, mmhc, fci, rankFci, fciplus
and rfci. For all other methods alpha is not used.
mode: Determines output type - can be "raw" or one of the queries "isParent", "isMaybeParent", "isNoParent", "isAncestor","isMaybeAncestor", "isNoAncestor". If "raw", getParents() returns the connectivity matrix computed by the specified method in sparse matrix format if sparse is set to TRUE; else in dense matrix format (or as list if returnAsList = TRUE). The options directed and pointConf will be ignored for all modes except for "raw" if set to TRUE. The different mode types are explained in the help for getRanking.
variableSelMat: An optional logical matrix of dimension (p x p). An entry TRUE for entry (i,j) says that variable i should be considered as a potential parent for variable j and vice versa for FALSE. If the default value of NULL is used, all variables will be considered, but this can be very slow, especially for methods pc, ges, gies, rfci and CAM. Ignored for methods backShift, fciplus, LINGAM and CAM.
excludeTargetInterventions: When looking for parents of variable k
in 1,...,p, set to TRUE if observations where an intervention on variable k occured should be excluded. Default is TRUE. Used in ICP and hiddenICP.
onlyObservationalData: If set to TRUE, only observational data is used. It will take the index in environment specified by indexObservationalData. If environment is NULL, all observations are used. Default is FALSE.
indexObservationalData: Index in environment that encodes observational data. Default is 1.
returnAsList: If set to TRUE, will return a list, where entry k is a list containing the estimated parents of variable k. Default is FALSE.
sparse: If set to TRUE and returnAsList is FALSE, output matrix will be in sparse matrix format.
directed: If TRUE, an edge will be returned if and only if an edge has been detected to be directed. I.e. entry will be set to 0 for entry (j,k) if both j−>k and k−>j are estimated (ICP, hiddenICP, regression), if j−−k is undirected (in case of CPDAGs) or if the edge type is not of type i−−>j in case of PAGs. If assumeNoSelectionVars = TRUE the edge type i−−oj is also considered 'directed' for methods returning PAGs. Default is FALSE. Only supported in mode "raw".
pointConf: If TRUE, numerical estimates will be returned if possible. For methods ICP and hiddenICP, these are the values in the individual confidence intervals (at chosen level alpha) that are closest to 0; for other methods these are point estimates. Some methods do not return numerical point estimates; for these the output will remain binary 0/1 (no-edge/edge). Default is FALSE. Only supported in mode "raw".
setOptions: A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values.
assumeNoSelectionVars: Set to TRUE is you want to assume the absence of selection variables.
verbose: If TRUE, detailed output is provided.
...: Parameters to be passed to underlying method's function.
Returns
If option returnAsList is FALSE, a sparse matrix, where a 0 entry in position (j,k) corresponds to an estimate of "no edge" j -> k, while an entry 1 corresponds to an estimated egde. If option pointConf is TRUE, the 1 entries will be replaced by numerical values that are either point estimates of the causal coefficients or confidence bounds (see above). If option returnAsList is TRUE, a list will be returned. The k-th entry in the list is the numeric vector with the indices of the estimated parents of node k.
Examples
## load the backShift package for data generation and plotting functionalityif(require(backShift)& require(pcalg)){# Simulate data with connectivity matrix A with assumptions# 1) hidden variables present# 2) precise location of interventions is assumed unknown# 3) different environments can be distinguished## simulate data myseed <-1# sample size n n <-10000# p=3 predictor variables and connectivity matrix A p <-3 labels <- c("1","2","3") A <- diag(p)*0 A[1,2]<-0.8 A[2,3]<-0.8 A[3,1]<--0.4# divide data in 10 different environments G <-10# simulate simResult <- backShift::simulateInterventions(n, p, A, G, intervMultiplier =3, noiseMult =1, nonGauss =TRUE, hiddenVars =TRUE, knownInterventions =FALSE, fracVarInt =NULL, simulateObs =TRUE, seed = myseed) X <- simResult$X
environment <- simResult$environment
## apply all methods given in vector 'methods'## (using all data pooled for pc/LINGAM/rfci/ges -- can be changed with option## 'onlyObservationalData=TRUE') methods <- c("backShift","LINGAM")#c("pc", "rfci", "ges")# select whether you want to run stability selection stability <-FALSE# arrange graphical output into a rectangular grid sq <- ceiling(sqrt(length(methods)+1)) par(mfrow=c(ceiling((length(methods)+1)/sq),sq))## plot and print true graph cat("\n true graph is ------ \n") print(A) plotGraphEdgeAttr(A, plotStabSelec =FALSE, labels = labels, thres.point =0, main ="TRUE GRAPH")## loop over all methods and compute and print/plot estimatefor(method in methods){ cat("\n result for method", method," ------ \n")if(!stability){# Option 1): use this estimator as a point estimate Ahat <- getParents(X, environment, method=method, alpha=0.1, pointConf =TRUE)}else{# Option 2): use a stability selection based estimator# with expected number of false positives bounded by EV=2 Ahat <- getParentsStable(X, environment, EV=2, method=method, alpha=0.1)}# print and plot estimate (point estimate thresholded if numerical estimates# are returned) print(Ahat)if(!stability) plotGraphEdgeAttr(Ahat, plotStabSelec =FALSE, labels = labels, thres.point =0.05, main=paste("POINT ESTIMATE FOR METHOD\n", toupper(method)))else plotGraphEdgeAttr(Ahat, plotStabSelec =TRUE, labels = labels, thres.point =0, main = paste("STABILITY SELECTION
ESTIMATE\n FOR METHOD", toupper(method)))}}else{ cat("\nThe packages 'backShift' and 'pcalg' are needed for the examples to
work. Please install them.")}
References
Naftali Harris and Mathias Drton: PC Algorithm for Nonparanormal Graphical Models. J. Mach. Learn. Res. 14(1) 2013.
See Also
getParentsStable for stability selection-based estimation of the causal graph.