A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs
Data balance. Most classes adopt random undersampling, while a few cla...
Reshape data and delete meaningless rows.
Performing feature selection in a automatic way based on correlation a...
Compute the confusion matrix for the predicted result.
Get ID attributes from the Biomart
database.
Get ID data from the Biomart
database using attributes
.
Plot the bar plot for feature importance.
Observe the distribution of the false response of the test set.
Get max length of ID data.
Plot correlation heatmap.
Plot heatmap for result confusion matrix.
Predict new data with a trained learner.
Compare classification models with small samples.
Cut the string of ID column character by character and divide it into ...
Split the string into individual characters and complete the character...
Convert data to numeric, and for the ID column convert with fixed leve...
Train a three layers neural network model.
Random Forest Model Training.
Classification tree model training.
Xgboost model training
Tune the Random Forest model by hyperband.
Tune the Decision Tree model by hyperband.
Tune the Xgboost model by hyperband.
Predict with four models and unify results by the sub-model's specific...
A wrapper function that executes MantaID workflow.
The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.