prep_dataset function

Prepare kraken report, k-mer statistics, UMI data

Prepare kraken report, k-mer statistics, UMI data

Three elements returned by this function:

  • kreport: Used by slsd().
  • kmer: Used by blsd(). The function count the number of k-mers and unique k-mers assigned to a taxon across barcodes. The cell barcode and unique molecular identifier (UMI) are used to identify unique barcodes and reads. Data is reported for taxa of pre-specified ranks (default genus + species) taking into account all subsequently higher resolution ranks. The output is a table of barcodes, taxonomic IDs, number of k-mers, and number of unique k-mers.
  • umi: Used by taxa_counts().
prep_dataset( fa1, kraken_report, kraken_out, fa2 = NULL, cb_and_umi = function(sequence_id, read1, read2) { list(substring(read1, 1L, 16L), substring(read1, 17L, 28L)) }, ranks = c("G", "S"), kmer_len = 35L, min_frac = 0.5, exclude = "9606", threads = 10L, overwrite = TRUE, odir = NULL ) read_dataset(dir)

Arguments

  • fa1, fa2: The path to microbiome fasta 1 and 2 file (returned by extract_kraken_reads()).
  • kraken_report: The path to kraken report file.
  • kraken_out: The path of microbiome output file. Usually should be filtered with extract_kraken_output().
  • cb_and_umi: A function takes sequence id, read1, read2 and return a list of 2 corresponding to cell barcode and UMI respectively., each should have the same length of the input.
  • ranks: Taxa ranks to analyze.
  • kmer_len: Kraken kmer length. Default: 35L, which is the default kmer size of kraken2.
  • min_frac: Minimum fraction of kmers directly assigned to taxid to use read. Reads with <=min_frac of the k-mers map inside the taxon's lineage are also discarded.
  • exclude: A character of taxid to exclude, for SAHMI, the host taxid. Reads with any k-mers mapped to the exclude are discarded.
  • threads: Number of threads to use.
  • overwrite: A bool indicates whether to overwrite the files in odir.
  • odir: A string of directory to save the results.
  • dir: A string of directory containing the files returned by prep_dataset.

Returns

A list of three polars DataFrame :

  • kreport: Used by slsd().
  • kmer: Used by blsd().
  • umi: Used by taxa_counts().

Examples

# for sequence from `umi-tools`, we can use following function cb_and_umi <- function(sequence_id, read1, read2) { out <- lapply( strsplit(sequence_id, "_", fixed = TRUE), `[`, 2:3 ) lapply(1:2, function(i) { vapply(out, function(o) as.character(.subset2(o, i)), character(1L)) }) } ## Not run: # 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()` # 2. `kraken_report` should be the output of `blit::kraken2()` # 3. `kraken_out` should be the output of `extract_kraken_output()` # 4. `dir`: you may want to specify the output directory since this process # is time-consuming sahmi_dataset <- prep_dataset( fa1 = "kraken_microbiome_reads.fa", # if you have paired sequence, please also specify `fa2`, # !!! Also pay attention to the file name of `fa1` (add suffix `_1`) # if you use paired reads. # fa2 = "kraken_microbiome_reads_2.fa", kraken_report = "kraken_report.txt", kraken_out = "kraken_microbiome_output.txt", odir = NULL ) # you may want to prepare all datasets for subsequent workflows. # `paths` should be the output directory for each sample from # `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`. sahmi_datasets <- lapply(paths, function(dir) { prep_dataset( fa1 = file.path(dir, "kraken_microbiome_reads.fa"), # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"), kraken_report = file.path(dir, "kraken_report.txt"), kraken_out = file.path(dir, "kraken_microbiome_output.txt"), odir = dir ) }) ## End(Not run)

See Also

https://github.com/sjdlabgroup/SAHMI

  • Maintainer: Yun Peng
  • License: MIT + file LICENSE
  • Last published: 2025-03-24