A Pipeline to Define Gene Families in Legumes and Beyond
Compute the total number of accession proteins per species
Extract the accession ids (XP accession) for a given organism
Get acessions and organism for each protein identifier
Get architecture identifiers for the conserved domains
Get the protein identifiers
Filter the protein architectures based on conserved domains
Filter protein architectures based on conserved domains
genehummus: A pipeline to define gene families in Legumes and beyond
Get the species name from the description sequence
Get the acessions ids and the organism for each protein identifier
Get the potential architecture identifiers for the conserved domains
Get the description label for a protein architecture identifier
Get the RefSeq protein identifiers for the given taxonomic species
Get the protein identifiers for a given architecture
Get the electronic architecture for a conserved domain
Get description label for a protein architecture identifier
Get RefSeq protein identifiers for the given taxonomic species
Build a list containing N elements per element list
A pipeline with high specificity and sensitivity in extracting proteins from the RefSeq database (National Center for Biotechnology Information). Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family. The pipelines implements an automatic approach for the identification of gene families based on the conserved domains that specifically define that family. See Die et al. (2018) <doi:10.1101/436659> for more information and examples.
Useful links