PoisMixClus function

Poisson mixture model estimation and model selection

Poisson mixture model estimation and model selection

These functions implement the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes) for a single number of clusters (PoisMixClus) or a sequence of cluster numbers (PoisMixClusWrapper). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011) or the splitting small-EM strategy described in Papastamoulis et al. (2014), and model selection is performed using the ICL criteria. Note that these functions implement the PMM-I and PMM-II models described in Rau et al. (2011).

PoisMixClus(y, g, conds, norm = "TMM", init.type = "small-em", init.runs = 1, init.iter = 10, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, prev.labels = NA, prev.probaPost = NA, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, wrapper = FALSE, subset.index = NA) PoisMixClusWrapper(y, gmin = 1, gmax, conds, norm = "TMM", gmin.init.type = "small-em", init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA, equal.proportions = FALSE, verbose = FALSE, interpretation = "sum", EM.verbose = FALSE, subset.index = NA)

Arguments

  • y: (n x q) matrix of observed counts for n observations and q variables

  • g: Number of clusters (a single value). If fixed.lambda contains a list of lambda values to be fixed, g corresponds to the number of clusters in addition to those fixed.

  • gmin: The minimum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value of lambda, gmin corresponds to the minimum number of clusters in addition to those that are fixed.

  • gmax: The maximum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value of lambda, gmax corresponds to the maximum number of clusters in addition to those that are fixed.

  • conds: Vector of length q defining the condition (treatment group) for each variable (column) in y

  • norm: The type of estimator to be used to normalize for differences in library size: (‘TC’ for total count, ‘UQ’ for upper quantile, ‘Med’ for median, ‘DESeq’ for the normalization method in the DESeq package, and ‘TMM’ for the TMM normalization method (Robinson and Oshlack, 2010). Can also be a vector (of length q) containing pre-estimated library size estimates for each sample. Note that if the user provides pre-calculated normalization factors, the package will make use of norm/sum(norm) as normalization factors.

  • init.type: Type of initialization strategy to be used (‘small-em’ for the Small-EM strategy described in Rau et al. (2011), and ‘kmeans’ for a simple K-means initialization)

  • gmin.init.type: Type of initialization strategy to be used for the minimum number of clusters in a sequence (gmin): (‘small-em’ for the Small-EM strategy described in Rau et al. (2011), and ‘kmeans’ for a simple K-means initialization)

  • init.runs: Number of runs to be used for the Small-EM strategy described in Rau et al. (2011), with a default value of 1

  • init.iter: Number of iterations to be used within each run for the Small-EM strategry, with a default value of 10

  • split.init: If TRUE, the splitting initialization strategy of Papastamoulis et al. (2014) will be used for cluster sizes (gmin+1, ..., gmax). If FALSE, the initialization strategy specified in gmin.init.type

    is used for all cluster sizes in the sequence.

  • alg.type: Algorithm to be used for parameter estimation (‘EM’ or ‘CEM’ )

  • cutoff: Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next)

  • iter: Maximum number of iterations to be run for the chosen algorithm

  • fixed.lambda: If one (or more) clusters with fixed values of lambda is desired, a list containing vectors of length d (the number of conditions). specifying the fixed values of lambda for each fixed cluster.

  • equal.proportions: If TRUE, the cluster proportions are set to be equal for all clusters. Default is FALSE (unequal cluster proportions).

  • prev.labels: A vector of length n of cluster labels obtained from the previous run (g-1 clusters) to be used with the splitting small-EM strategy described in described in Papastamoulis et al. (2014). For other initialization strategies, this parameter takes the value NA

  • prev.probaPost: An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run, to be used with the splitting small-EM strategy of described in Papastamoulis et al. (2012). For other initialization strategies, this parameter takes the value NA

  • verbose: If TRUE, include verbose output

  • interpretation: If "sum", cluster behavior is interpreted with respect to overall gene expression level (sums per gene), otherwise for "mean", cluster behavior is interpreted with respect to mean gene expression (means per gene).

  • EM.verbose: If TRUE, more informative output is printed about the EM algorithm, including the number of iterations run and the difference between log-likelihoods at the last and penultimate iterations.

  • subset.index: Optional vector providing the indices of a subset of genes that should be used for the co-expression analysis (i.e., row indices of the data matrix y.

  • wrapper: TRUE if the PoisMixClus function is run from within the PoisMixClusWrapper main function, and FALSE

    otherwise. This mainly helps to avoid recalculating parameters several times that are used throughout the algorithm (e.g., library sizes, etc.)

Details

Output of PoisMixClus is an S3 object of class HTSCluster, and output of PoisMixClusWrapper is an S3 object of class HTSClusterWrapper.

In a Poisson mixture model, the data yy are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as

f(y;g,Ψg)=i=1nk=1gπkj=1dl=1rjP(yijl;θk)f(y;g,ψg)=i=1nk=1gπkj=1dl=1rjP(yijl;θk) f(\mathbf{y};g,\boldsymbol{\Psi}_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \boldsymbol{\theta}_k)f(y;g,\psi_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \theta_k)

for i=1,,ni = 1, \ldots, n observations in l=1,,rjl = 1, \ldots, r_j replicates of j=1,,dj = 1, \ldots, d conditions (treatment groups), where P()P(\cdot) is the standard Poisson density, ψg=(π1,,πg1,θ)\psi_g = (\pi_1,\ldots,\pi_{g-1}, \theta^\prime), θ\theta^\prime contains all of the parameters in θ1,,θg\theta_1,\ldots,\theta_g assumed to be distinct, and π=(π1,,πg)\pi = (\pi_1,\ldots,\pi_g)^\prime are the mixing proportions such that πk\pi_k is in (0,1) for all k and kπk=1\sum_k \pi_k = 1.

We consider the following parameterization for the mean θ=(muijlk)\theta = (mu_{ijlk}). We consider

μijlk=wisjlλjk \mu_{ijlk} = w_i s_{jl} \lambda_{jk}

where wiw_i corresponds to the expression level of observation i, λk=(λ1k,,λdk)\lambda_k = (\lambda_{1k},\ldots,\lambda_{dk})

corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables, and sjls_{jl} is the normalized library size (a fixed constant) for replicate l of condition j.

There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit function. Model selection may be performed using the BIC or ICL criteria, or the slope heuristics.

Returns

  • lambda: (d x g) matrix containing the estimate of λ^\hat{\lambda}

  • pi: Vector of length g containing the estimate of π^\hat{\pi}

  • labels: Vector of length n containing the cluster assignments of the n observations

  • probaPost: Matrix containing the conditional probabilities of belonging to each cluster for all observations

  • log.like: Value of log likelihood

  • BIC: Value of BIC criterion

  • ICL: Value of ICL criterion

  • alg.type: Estimation algorithm used; matches the argument alg.type above)

  • norm: Library size normalization factors used

  • conds: Conditions specified by user

  • iterations: Number of iterations run

  • logLikeDiff: Difference in log-likelihood between the last and penultimate iterations of the algorithm

  • subset.index: If provided by the user, the indices of subset of genes used for co-expression analyses

  • loglike.all: Log likelihoods calculated for each of the fitted models for cluster sizes gmin, ..., gmax

  • capushe: Results of capushe model selection, an object of class "Capushe"

  • ICL.all: ICL values calculated for each of the fitted models for cluster sizes gmin, ..., gmax

  • ICL.results: Object of class HTSCluster giving the results from the model chosen via the ICL criterion

  • BIC.results: Object of class HTSCluster giving the results from the model chosen via the BIC

  • DDSE.results: Object of class HTSCluster giving the results from the model chosen via the DDSE slope heuristics criterion

  • Djump.results: Object of class HTSCluster giving the results from the model chosen via the Djump slope heuristics criterion

  • all.results: List of objects of class HTSCluster giving the results for all models for cluster sizes gmin, ..., gmax

  • model.selection: Type of criteria used for model selection, equal to NA for direct calls to PoisMixClus or "DDSE", "Djump", "BIC", or "ICL" for the respective selected models for calls to PoisMixClusWrapper

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11 (R106), 1-28.

Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.

Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.

Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.

Note

Note that the fixed.lambda argument is primarily intended to be used in the case when a single cluster is fixed to have equal clustering parameters lambda across all conditions (i.e., λj1=λ1=1\lambda_{j1}=\lambda_{1}=1); this is particularly useful when identifying genes with non-differential expression across all conditions (see the HTSDiff R package for more details). Alternatively, this argument could be used to specify a cluster for which genes are only expressed in a single condition (e.g., λ11=1\lambda_{11} = 1 and λj1=0\lambda_{j1} = 0 for all j>1j > 1). Other possibilities could be considered, but note that the fixed values of lambda must satisfy the constraint jλjksj.=1\sum_j \lambda_{jk}s_{j.} = 1 for all kk

imposed in the model; if this is not the case, a warning message will be printed.

Author(s)

Andrea Rau

See Also

probaPost for the calculation of the conditional probability of belonging to a cluster; PoisMixMean for the calculation of the per-cluster conditional mean of each observation; logLikePoisMixDiff for the calculation of the log likelihood of a Poisson mixture model; emInit and kmeanInit for the Small-EM parameter initialization strategy

Examples

set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 200 observations simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Run the PMM model for g = 3 ## "TC" library size estimate, EM algorithm run <- PoisMixClus(y, g = 3, conds = conds, norm = "TC") ## Estimates of pi and lambda for the selected model pi.est <- run$pi lambda.est <- run$lambda ## Not run: PMM for 4 total clusters, with one fixed class ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 3, norm = "TC", conds = conds, ## fixed.lambda = list(c(1,1,1))) ## ## ## Not run: PMM model for 4 clusters, with equal proportions ## "TC" library size estimate, EM algorithm ## ## run <- PoisMixClus(y, g = 4, norm = "TC", conds = conds, ## equal.proportions = TRUE) ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init ## ## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC") ## ## ## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init ## ## run1.10bis <- <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, ## norm = "TC", split.init = FALSE) ## ## ## Not run: previous model equivalent to the following ## ## for(K in 1:10) { ## run <- PoisMixClus(y, g = K, conds = conds, norm = "TC") ## }
  • Maintainer: Andrea Rau
  • License: GPL (>= 3)
  • Last published: 2023-09-05

Useful links