thin_all function

Binomial thinning for altering read-depth.

Binomial thinning for altering read-depth.

Given a matrix of real RNA-seq counts, this function will apply a thinning factor uniformly to every count in this matrix. This uniformly lowers the read-depth for the entire dataset. The thinning factor should be provided on the log2-scale. This is a specific application of the binomial thinning approach in thin_diff. Though this particular form of thinning was used by Robinson and Storey (2014) in the context of deriving read-depth suggestions. It is also described in detail in Gerard (2020).

thin_all(mat, thinlog2, type = c("thin", "mult"))

Arguments

  • mat: A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.
  • thinlog2: A numeric scalar. This is the amount to shrink each count in mat (on the log2-scale). For example, a value of 0 means that we do not thin, a value of 1 means that we thin by a factor of 2, a value of 2 means we thin by a factor of 4, etc.
  • type: Should we apply binomial thinning (type = "thin") or just naive multiplication of the counts (type = "mult"). You should always have this set to "thin".

Returns

A list-like S3 object of class ThinData. Components include some or all of the following:

  • mat: The modified matrix of counts.
  • designmat: The design matrix of variables used to simulate signal. This is made by column-binding design_fixed and the permuted version of design_perm.
  • coefmat: A matrix of coefficients corresponding to designmat.
  • design_obs: Additional variables that should be included in your design matrix in downstream fittings. This is made by column-binding the vector of 1's with design_obs.
  • sv: A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
  • cormat: A matrix of target correlations between the surrogate variables and the permuted variables in the design matrix. This might be different from the target_cor you input because we pass it through fix_cor to ensure positive semi-definiteness of the resulting covariance matrix.
  • matching_var: A matrix of simulated variables used to permute design_perm if the target_cor is not NULL.

Examples

## Generate count data and set thinning factor ## In practice, you would obtain mat from a real dataset, not simulate it. set.seed(1) n <- 10 p <- 1000 lambda <- 1000 mat <- matrix(lambda, ncol = n, nrow = p) thinlog2 <- 1 ## Thin read-depths thout <- thin_all(mat = mat, thinlog2 = thinlog2) ## Compare empirical and theoretical proportions mean(thout$mat) / lambda 2 ^ -thinlog2

References

  • Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. tools:::Rd_expr_doi("10.1186/s12859-020-3450-9") .
  • Robinson, David G., and John D. Storey. "subSeq: determining appropriate sequencing depth through efficient read subsampling." Bioinformatics 30, no. 23 (2014): 3424-3426. tools:::Rd_expr_doi("10.1093/bioinformatics/btu552") .

See Also

  • select_counts: For subsampling the rows and columns of your real RNA-seq count matrix prior to applying binomial thinning.
  • thin_diff: For the more general thinning approach.
  • thin_lib: For thinning sample-wise.
  • thin_gene: For thinning gene-wise.
  • ThinDataToSummarizedExperiment: For converting a ThinData object to a SummarizedExperiment object.
  • ThinDataToDESeqDataSet: For converting a ThinData object to a DESeqDataSet object.

Author(s)

David Gerard

  • Maintainer: David Gerard
  • License: GPL-3
  • Last published: 2024-05-15