assess_reference_loo function

Simulate mixtures and estimate reporting group and collection proportions.

Simulate mixtures and estimate reporting group and collection proportions.

From a reference dataset, this creates a genotype-logL matrix based on simulation-by-individual with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.

assess_reference_loo( reference, gen_start_col, reps = 50, mixsize = 100, seed = 5, alpha_repunit = 1.5, alpha_collection = 1.5, resampling_unit = "individual", alle_freq_prior = list(const_scaled = 1), printSummary = FALSE, return_indiv_posteriors = FALSE )

Arguments

  • reference: a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries

  • gen_start_col: the first column of genetic data in reference

  • reps: number of reps of mixture simulation and MCMC to do

  • mixsize: the number of individuals in each simulated mixture

  • seed: a random seed for simulations

  • alpha_repunit: If a vector, this is the dirichlet parameter for simulating the proportions of reporting units. Gets recycled to the number of reporting units. Default is 1.5. Otherwise, this could be a two-column data frame. The first column must be named "repunit" and the second one must be one of "dirichlet", "ppn", or "cnt", according to whether you wish to specify dirichlet parameters, or proportions, or exact counts, respectively, for each population. If you want to make multiple simulations, pass in a list of data frames or of individual dirichlet parameters. For examples, see sim_spec_examples.

  • alpha_collection: The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5. If this is a data frame then the first column must be "collection" and the second must be one of "dirichlet", "ppn", "cnt", "sub_dirichlet", "sub_ppn". If you want to provide multiple different scenarios. You can pass them in as a list. If alpha_repunit or alpha_collection is a list with length greater than 1, the shorter will be recycled. For examples, see sim_spec_examples.

  • resampling_unit: what unit should be resampled. Currently the choices are "individuals" (the default) and "gene_copies". Using "individuals" preserves missing data patterns available in the reference data set. We also have "gene_copies_with_missing" capability, but it is not yet linked into this function.

  • alle_freq_prior: a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params

    for method details.

  • printSummary: if TRUE a summary of the reference samples will be printed to stdout.

  • return_indiv_posteriors: if TRUE, output is a list of 2. The first entry, mixing_proportions, contains the true (simulated) and estimated mixture proportions for each scenario, iteration, and collection. The second, indiv_posteriors, contains the posterior probability of assignment to each collection for each scenario, iteration, and individual. If FALSE, output is a single data frame, mixing_proportions

Examples

# very small number of reps so it is quick enough for example ale_dev <- assess_reference_loo(alewife, 17, reps = 5)
  • Maintainer: Eric C. Anderson
  • License: CC0
  • Last published: 2024-01-24

Useful links