Testing procedures for estimation of mutual information
Testing procedures for estimation of mutual information
Diagnostic procedures that allows to compute the uncertainty of estimation of mutual information by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to call this function from mi_logreg_main.R.
data: must be a data.frame object. Cannot contain NA values.
signal: is a character object with names of columns of dataRaw to be treated as channel's input.
response: is a character vector with names of columns of dataRaw to be treated as channel's output
side_variables: (optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included
pinput: is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL).
lr_maxit: is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.
MaxNWts: is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.
formula_string: (optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.
TestingSeed: is the seed for random number generator used in testing procedures
testing_cores: - number of cores to be used in parallel computing (via doParallel package)
boot_num: is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
boot_prob: is the proportion of initial size of data to be used in bootstrap
sidevar_num: is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
traintest_num: is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
partition_trainfrac: is the fraction of data to be used as a training dataset
Returns
a list with elements:
output$bootstrap - bootstrap test
output$traintest - overfitting test
output$reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test
output$bootstrap_Reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test with a bootstrap
Each of the above is a list, where an element is a standard output of a single mi_logreg_algorithm run.
Details
If side variables are added within the analysis (side_variables is not NULL), two additional procedures are carried out: reshuffling test and reshuffling with bootstrap test, which are based on permutation of side variables values within the dataset. Additional parameters: lr_maxit and MaxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users).
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Compute uncertainty of mutual information estimator using 1 core## Set boot_num and traintest_num with larger numbers for more reliable testingtempdata=data_example1
output=mi_logreg_testing(data=tempdata, signal ="signal", response ="response", testing_cores =1,boot_num=1,traintest_num=1)