ZicoSeq function

A linear Model-based Permutation Test for Differential Abundance Analysis of Microbiome Data and Other Omics Data

A linear Model-based Permutation Test for Differential Abundance Analysis of Microbiome Data and Other Omics Data

ZicoSeq is a permutation test (Smith permutation) for differential abundance analysis of microbiome sequencing data. The input can be a count or a proportion matrix. When a count matrix is provided, it provides an option to draw posterior samples of the underlying proportions to account for the sampling variability during the sequencing process. The test results are aggregated over these posterior samples. For both count and proportion data, a reference-based ratio approach is used to account for compositional effects. As a general methodology, ZicoSeq can also be applied to differential analysis of other omics data. In this case, they are not treated as compositional data.

ZicoSeq( meta.dat, feature.dat, grp.name, adj.name = NULL, feature.dat.type = c('count', 'proportion', 'other'), prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, min.prop = 0, is.winsor = TRUE, outlier.pct = 0.03, winsor.end = c('top', 'bottom', 'both'), is.post.sample = TRUE, post.sample.no = 25, link.func = list(function(x) sign(x) * (abs(x))^0.5), stats.combine.func = max, perm.no = 99, strata = NULL, ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, p.max = 500, is.fwer = FALSE, verbose = TRUE, return.feature.dat = TRUE )

Arguments

  • meta.dat: a data frame containing the sample meta data.
  • feature.dat: a matrix of feature data, row - features (OTUs, genes, etc) , column - samples.
  • grp.name: the name for the variable of interest. It could be numeric or categorical; should be in meta.dat.
  • adj.name: the name(s) for the variable(s) to be adjusted. Multiple variables are allowed. They could be numeric or categorical; should be in meta.dat.
  • feature.dat.type: the type of the feature data. It could be "count", "proportion" or "other". For "proportion" data type, posterior sampling will not be performed, but the reference-based ratio approach will still be used to address compositional effects. For "other" data type, neither posterior sampling or reference-base ratio approach will be used.
  • prev.filter: the prevalence (percentage of nonzeros) cutoff, under which the features will be filtered. The default is 0.
  • mean.abund.filter: the mean relative abundance cutoff, under which the features will be filtered. The default is 0.
  • max.abund.filter: the max relative abundance cutoff, under which the features will be filtered. The default is 0.
  • min.prop: proportions less than this value will be replaced with this value. Only relevant when log transformation is used. Default is 0.
  • is.winsor: a logical value indicating whether winsorization should be performed to replace outliers. The default is TRUE.
  • outlier.pct: the expected percentage of outliers. These outliers will be winsorized. The default is 0.03. For count/proportion data, outlier.pct should be less than prev.filter.
  • winsor.end: a character indicating whether the outliers at the "top", "bottom" or "both" will be winsorized. The default is "top". If the feature.dat.type is "other", "both" may be considered.
  • is.post.sample: a logical value indicating whether to perform posterior sampling of the underlying proportions. Only relevant when the feature data are counts.
  • post.sample.no: the number of posterior samples if posterior sampling is used. The default is 25.
  • link.func: a list of transformation functions for the feature data or the ratios. Based on our experience, square-root transformation is a robust choice for many datasets.
  • perm.no: the number of permutations. If the raw p values are of the major interest, set perm.no to at least 999.
  • strata: a factor such as subject IDs indicating the permutation strata or characters indicating the strata variable in meta.dat. Permutation will be confined to each stratum. This can be used for paired or some longitudinal designs.
  • stats.combine.func: function to combine the F-statistic for the omnibus test. The default is max.
  • ref.pct: percentage of reference taxa. The default is 0.5.
  • p.max: the maximum number of (most abundant) taxa to be considered in reference taxa selection; only relevant when the nubmer of taxa is huge. The default is 500, i.e., when the nubmer of taxa is larger than 500, only the 500 most abundant taxa will be used for reference selection. This is to reduce the computational time.
  • stage.no: the number of stages if multiple-stage normalization is used. The default is 6.
  • excl.pct: the maximum percentage of significant features (nominal p-value < 0.05) in the reference set that should be removed. Only relevant when multiple-stage normalization is used.
  • is.fwer: a logical value indicating whether the family-wise error rate control (West-Young) should be performed.
  • verbose: a logical value indicating whether the trace information should be printed out.
  • return.feature.dat: a logical value indicating whether the winsorized, filtered "feature.dat" matrix should be returned.

Returns

A list with the elements - call: the call

  • feature.dat: the winsorized, filtered feature.dat matrix.

  • meta.dat: meta.dat used.

  • grp.name: the name of the variable of interest.

  • filter.features: a vector of the names of the features that are filtered.

  • ref.features: a vector of the names of the reference features. Only relevant when reference approach is used.

  • R2: a matrix of percent explained variance (number of features by number of transformation functions).

  • F0: a matrix of F-statistics (number of features by number of transformation functions).

  • RSS: a matrix of residual sum squares (number of features by number of transformation functions).

  • df.model, df.residual: degrees of freedom for the model and residual space.

  • coef.list: a list of the linear regression coefficients under the specified transformations.

  • p.raw: the raw p-values based on permutations (not accurate if perm.no is small).

  • p.adj.fdr: permutation-based FDR-adjusted p-values.

  • p.adj.fwer: permutation-based FWER-adjusted (West-Young) p-values.

Details

ZicoSeq is a linear model-based permutation test developed for differential abundance analysis of zero-inflated compositional data. Although its development is motivated by zero-inflated microbiome sequence count data, it can be applied to proportion (composition) data and more generally to other types of omics data. Currently, it has the following components: 1. Winsorization to decrease the influence of outliers; 2. Posterior sampling based on a beta mixture prior to address sampling variability and zero inflation; 3. Reference-based multiple-stage normalization to address compositional effects; 4. An omnibus test to address diverse feature-covariate relationships; 5. Permutation-based false discovery rate control / family-wise error rate control for multiple testing correction, which takes into account the correlation structure in the feature data.

References

Yang, L. & Chen, J. 2022. A comprehensive evaluation of differential abundance analysis methods: current status and potential solutions. Microbiome, 10(1), 1-23.

See Also

ZicoSeq.plot

Author(s)

Jun Chen

Examples

data(throat.otu.tab) data(throat.tree) data(throat.meta) comm <- t(throat.otu.tab) meta.dat <- throat.meta set.seed(123) # For count data zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm, grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "count", # Filter to remove rare taxa prev.filter = 0.2, mean.abund.filter = 0, max.abund.filter = 0.002, min.prop = 0, # Winsorization to replace outliers is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top', # Posterior sampling to impute zeros is.post.sample = TRUE, post.sample.no = 25, # Multiple link functions to capture diverse taxon-covariate relation link.func = list(function (x) x^0.25, function (x) x^0.5, function (x) x^0.75), stats.combine.func = max, # Permutation-based multiple testing correction perm.no = 99, strata = NULL, # Reference-based multiple stage normalization ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, # Family-wise error rate control is.fwer = FALSE, verbose = TRUE, return.feature.dat = FALSE) which(zico.obj$p.adj.fdr <= 0.05) # For proportion data comm.p <- t(t(comm) / colSums(comm)) zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.p, grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "proportion", # Filter to remove rare taxa prev.filter = 0.2, mean.abund.filter = 0, max.abund.filter = 0.002, min.prop = 0, # Winsorization to replace outliers is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top', # Posterior sampling will be automatically disabled is.post.sample = FALSE, post.sample.no = 25, # Use the square-root transformation link.func = list(function (x) x^0.5), stats.combine.func = max, # Permutation-based multiple testing correction perm.no = 99, strata = NULL, # Reference-based multiple stage normalization ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, # Family-wise error rate control is.fwer = FALSE, verbose = TRUE, return.feature.dat = FALSE) which(zico.obj$p.adj.fdr <= 0.05) # For other type of data. The user should be responsible for the filtering. comm.o <- comm[rowMeans(comm != 0) >= 0.2, ] + 1 comm.o <- log(t(t(comm.o) / colSums(comm.o))) zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.o, grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "other", # Filter will not be applied prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, min.prop = 0, # Winsorization to both ends of the distribution is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'both', # Posterior sampling will be automatically disabled is.post.sample = FALSE, post.sample.no = 25, # Identity function is used link.func = list(function (x) x), stats.combine.func = max, # Permutation-based multiple testing correction perm.no = 99, strata = NULL, # Reference-based multiple-stage normalization will not be performed ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, # Family-wise error rate control is.fwer = TRUE, verbose = TRUE, return.feature.dat = FALSE) which(zico.obj$p.adj.fdr <= 0.05)
  • Maintainer: Jun Chen
  • License: GPL-3
  • Last published: 2023-09-14

Useful links