MantaID1.0.4 package

A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

mi_balance_data

Data balance. Most classes adopt random undersampling, while a few cla...

mi_clean_data

Reshape data and delete meaningless rows.

mi_filter_feat

Performing feature selection in a automatic way based on correlation a...

mi_get_confusion

Compute the confusion matrix for the predicted result.

mi_get_ID_attr

Get ID attributes from the Biomart database.

mi_get_ID

Get ID data from the Biomart database using attributes.

mi_get_importance

Plot the bar plot for feature importance.

mi_get_miss

Observe the distribution of the false response of the test set.

mi_get_padlen

Get max length of ID data.

mi_plot_cor

Plot correlation heatmap.

mi_plot_heatmap

Plot heatmap for result confusion matrix.

mi_predict_new

Predict new data with a trained learner.

mi_run_bmr

Compare classification models with small samples.

mi_split_col

Cut the string of ID column character by character and divide it into ...

mi_split_str

Split the string into individual characters and complete the character...

mi_to_numer

Convert data to numeric, and for the ID column convert with fixed leve...

mi_train_BP

Train a three layers neural network model.

mi_train_rg

Random Forest Model Training.

mi_train_rp

Classification tree model training.

mi_train_xgb

Xgboost model training

mi_tune_rg

Tune the Random Forest model by hyperband.

mi_tune_rp

Tune the Decision Tree model by hyperband.

mi_tune_xgb

Tune the Xgboost model by hyperband.

mi_unify_mod

Predict with four models and unify results by the sub-model's specific...

mi

A wrapper function that executes MantaID workflow.

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.