n_proteins: a numeric value that specifies the number of proteins in the synthetic dataset.
frac_change: a numeric value that specifies the fraction of proteins that has a peptide changing in abundance. So far only one peptide per protein is changing.
n_replicates: a numeric value that specifies the number of replicates per condition.
n_conditions: a numeric value that specifies the number of conditions.
method: a character value that specifies the method type for the random sampling of significantly changing peptides. If method = "effect_random", the effect for each condition is randomly sampled and conditions do not depend on each other. If method = "dose_response", the effect is sampled based on a dose response curve and conditions are related to each other depending on the curve shape. In this case the concentrations argument needs to be specified.
concentrations: a numeric vector of length equal to the number of conditions, only needs to be specified if method = "dose_response". This allows equal sampling of peptide intensities. It ensures that the same positions of dose response curves are sampled for each peptide based on the provided concentrations.
median_offset_sd: a numeric value that specifies the standard deviation of normal distribution that is used for sampling of inter-sample-differences. Default is 0.05.
mean_protein_intensity: a numeric value that specifies the mean of the protein intensity distribution. Default: 16.8.
sd_protein_intensity: a numeric value that specifies the standard deviation of the protein intensity distribution. Default: 1.4.
mean_n_peptides: a numeric value that specifies the mean number of peptides per protein. Default: 12.75.
size_n_peptides: a numeric value that specifies the dispersion parameter (the shape parameter of the gamma mixing distribution). Can be theoretically calculated as mean + mean^2/variance, however, it should be rather obtained by fitting the negative binomial distribution to real data. This can be done by using the optim function (see Example section). Default: 0.9.
mean_sd_peptides: a numeric value that specifies the mean of peptide intensity standard deviations within a protein. Default: 1.7.
sd_sd_peptides: a numeric value that specifies the standard deviation of peptide intensity standard deviation within a protein. Default: 0.75.
mean_log_replicates, sd_log_replicates: a numeric value that specifies the meanlog
and sdlog of the log normal distribution of replicate standard deviations. Can be obtained by fitting a log normal distribution to the distribution of replicate standard deviations from a real dataset. This can be done using the optim function (see Example section). Default: -2.2 and 1.05.
effect_sd: a numeric value that specifies the standard deviation of a normal distribution around mean = 0 that is used to sample the effect of significantly changeing peptides. Default: 2.
dropout_curve_inflection: a numeric value that specifies the intensity inflection point of a probabilistic dropout curve that is used to sample intensity dependent missing values. This argument determines how many missing values there are in the dataset. Default: 14.
dropout_curve_sd: a numeric value that specifies the standard deviation of the probabilistic dropout curve. Needs to be negative to sample a droupout towards low intensities. Default: -1.2.
additional_metadata: a logical value that determines if metadata such as protein coverage, missed cleavages and charge state should be sampled and added to the list.
Returns
A data frame that contains complete peptide intensities and peptide intensities with values that were created based on a probabilistic dropout curve.
Examples
create_synthetic_data( n_proteins =10, frac_change =0.1, n_replicates =3, n_conditions =2)# determination of mean_n_peptides and size_n_peptides parameters based on real data (count)# example peptide count per proteincount <- c(6,3,2,0,1,0,1,2,2,0)theta <- c(mu =1, k =1)negbinom <-function(theta){-sum(stats::dnbinom(count, mu = theta[1], size = theta[2], log =TRUE))}fit <- stats::optim(theta, negbinom)fit
# determination of mean_log_replicates and sd_log_replicates parameters# based on real data (standard_deviations)# example standard deviations of replicatesstandard_deviations <- c(0.61,0.54,0.2,1.2,0.8,0.3,0.2,0.6)theta2 <- c(meanlog =1, sdlog =1)lognorm <-function(theta2){-sum(stats::dlnorm(standard_deviations, meanlog = theta2[1], sdlog = theta2[2], log =TRUE))}fit2 <- stats::optim(theta2, lognorm)fit2