taxonomy function

Extract Data from NCBI Taxonomy Files

Extract Data from NCBI Taxonomy Files

Read data from NCBI taxonomy files, traverse taxonomic ranks, get scientific names of taxonomic nodes. UTF-8

getnodes(taxdir) getrank(id, taxdir, nodes=NULL) parent(id, taxdir, rank=NULL, nodes=NULL) allparents(id, taxdir, nodes=NULL) getnames(taxdir) sciname(id, taxdir, names=NULL)

Arguments

  • taxdir: character, directory where the taxonomy files are kept.
  • id: numeric, taxonomic ID(s) of the nodes of interest.
  • nodes: dataframe, output from getnodes (optional).
  • rank: character, name of the taxonomic rank of interest.
  • names: dataframe, output from getnames (optional).

Details

These functions provide a convenient way to read data from NCBI taxonomy files (i.e., the contents of taxdump.tar.gz, which is available from https://ftp.ncbi.nih.gov/pub/taxonomy/).

The taxdir argument is used to specify the directory where the nodes.dmp and names.dmp files are located. getnodes and getnames read these files into data frames. getrank returns the rank (species, genus, etc) of the node with the given taxonomic id. parent returns the taxonomic ID of the next-lowest node below that specified by the id in the argument, unless rank is supplied, in which case the function descends the tree until a node with that rank is found. allparents returns all the taxonomic IDs of all nodes between that specified by id and the root of the tree, inclusive. sciname returns the scientific name of the node with the given id.

The id argument can be of length greater than 1 except for allparents. If getrank, parent, allparents or sciname need to be called repeatedly, the operation can be hastened by supplying the output of getnodes in the nodes argument and/or the output of getnames in the names argument.

Examples

## Get information about Homo sapiens from the ## packaged taxonomy files taxdir <- system.file("extdata/taxonomy", package = "CHNOSZ") # H. sapiens' taxonomic id id1 <- 9606 # That is a species getrank(id1, taxdir) # The next step up the taxonomy id2 <- parent(id1, taxdir) print(id2) # That is a genus getrank(id2, taxdir) # That genus is "Homo" sciname(id2, taxdir) # We can ask what phylum is it part of? id3 <- parent(id1, taxdir, "phylum") # Answer: "Chordata" sciname(id3, taxdir) # H. sapiens' complete taxonomy id4 <- allparents(id1, taxdir) sciname(id4, taxdir) ## The names of the organisms in the supplied taxonomy files taxdir <- system.file("extdata/taxonomy", package = "CHNOSZ") id5 <- c(83333, 4932, 9606, 186497, 243232) sciname(id5, taxdir) # These are not all species, though # (those with "no rank" are something like strains, # e.g. Escherichia coli K-12) getrank(id5, taxdir) # Find the species for each of these id6 <- sapply(id5, function(x) parent(x, taxdir = taxdir, rank = "species")) unique(getrank(id6, taxdir)) # "species" # Note that the K-12 is dropped sciname(id6, taxdir) ## The complete nodes.dmp and names.dmp files are quite large, ## so it helps to store them in memory when performing multiple queries ## (this doesn't have a noticeable speed-up for the small files in this example) taxdir <- system.file("extdata/taxonomy", package = "CHNOSZ") nodes <- getnodes(taxdir = taxdir) # All of the node ids in this file id7 <- nodes$id # All of the non-leaf nodes id8 <- unique(parent(id7, nodes = nodes)) names <- getnames(taxdir = taxdir) sciname(id8, names = names)
  • Maintainer: Jeffrey Dick
  • License: GPL-3
  • Last published: 2024-02-11