Given a matrix of real RNA-seq counts, this function will apply a separate, user-provided thinning factor to each sample. This uniformly lowers the counts for all genes in a sample. The thinning factor should be provided on the log2-scale. This is a specific application of the binomial thinning approach in thin_diff. The method is described in detail in Gerard (2020).
thin_lib(mat, thinlog2, relative =FALSE, type = c("thin","mult"))
Arguments
mat: A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.
thinlog2: A vector of numerics. Element i is the amount to thin (on the log2-scale) for sample i. For example, a value of 0 means that we do not thin, a value of 1 means that we thin by a factor of 2, a value of 2 means we thin by a factor of 4, etc.
relative: A logical. Should we apply relative thinning (TRUE) or absolute thinning (FALSE). Only experts should change the default.
type: Should we apply binomial thinning (type = "thin") or just naive multiplication of the counts (type = "mult"). You should always have this set to "thin".
Returns
A list-like S3 object of class ThinData. Components include some or all of the following:
mat: The modified matrix of counts.
designmat: The design matrix of variables used to simulate signal. This is made by column-binding design_fixed and the permuted version of design_perm.
coefmat: A matrix of coefficients corresponding to designmat.
design_obs: Additional variables that should be included in your design matrix in downstream fittings. This is made by column-binding the vector of 1's with design_obs.
sv: A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat: A matrix of target correlations between the surrogate variables and the permuted variables in the design matrix. This might be different from the target_cor you input because we pass it through fix_cor to ensure positive semi-definiteness of the resulting covariance matrix.
matching_var: A matrix of simulated variables used to permute design_perm if the target_cor is not NULL.
Examples
## Generate count data and thinning factors## In practice, you would obtain mat from a real dataset, not simulate it.set.seed(1)n <-10p <-1000lambda <-1000mat <- matrix(lambda, ncol = n, nrow = p)thinlog2 <- rexp(n = n, rate =1)## Thin library sizesthout <- thin_lib(mat = mat, thinlog2 = thinlog2)## Compare empirical thinning proportions to specified thinning proportionsempirical_propvec <- colMeans(thout$mat)/ lambda
specified_propvec <-2^(-thinlog2)empirical_propvec
specified_propvec
References
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. tools:::Rd_expr_doi("10.1186/s12859-020-3450-9") .
See Also
select_counts: For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
thin_diff: For the more general thinning approach.
thin_gene: For thinning gene-wise instead of sample-wise.
thin_all: For thinning all counts uniformly.
ThinDataToSummarizedExperiment: For converting a ThinData object to a SummarizedExperiment object.
ThinDataToDESeqDataSet: For converting a ThinData object to a DESeqDataSet object.