Testing procedures for estimation of channel capacity
Testing procedures for estimation of channel capacity
Diagnostic procedures that allows to compute the uncertainty of estimation of channel capacity by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to conduct estimation by calling capacity_logreg_main.R.
data: must be a data.frame object. Cannot contain NA values.
signal: is a character object with names of columns of dataRaw to be treated as channel's input.
response: is a character vector with names of columns of dataRaw to be treated as channel's output
side_variables: (optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included
cc_maxit: is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100.
lr_maxit: is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.
MaxNWts: is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.
formula_string: (optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.
TestingSeed: is the seed for random number generator used in testing procedures
testing_cores: - number of cores to be used in parallel computing (via doParallel package)
boot_num: is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
boot_prob: is the proportion of initial size of data to be used in bootstrap. Default is 0.8.
sidevar_num: is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
traintest_num: is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.
partition_trainfrac: is the fraction of data to be used as a training dataset. Default is 0.6.
Returns
a list with four elements:
output$bootstrap - confusion matrix of logistic regression predictions
output$resamplingMorph - channel capacity in bits
output$traintest - optimal probability distribution
output$bootResampMorph - nnet object describing logistic regression model (if model_out=TRUE)
Each of above is a list, where an element is an output of a single repetition of the channel capacity algorithm
Details
If side variables are added within the analysis (side_variables is not NULL), two additional procedures are carried out: reshuffling test and reshuffling with bootstrap test, which are based on permutation of side variables values within the dataset. Additional parameters: lr_maxit and MaxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users).
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Please set boot_num and traintest_num with larger numbers ## for a more reliable testingtempdata=data_example1
outputCLR1_testing=capacity_logreg_testing(data=tempdata,signal="signal", response="response",cc_maxit=10,TestingSeed=11111, boot_num=1,boot_prob=0.8,testing_cores=1,traintest_num=1,partition_trainfrac=0.6)