MN.Fdtf.Feature() R function from [EncDNA]

Sequence encoding with nucleotide frequency difference between two classes of sequence datasets.

In this encoding procedure, at first, frequency of each nucleotide at each position is computed for both positive and negative classes datasets. Then, the frequency matrix of the positive set is substracted from that of negative set. The sequences are then encoded into numeric vectors after passing them through this difference matrix. So, both positive and negative datasets are necessary for encoding of sequences. This concept was introduced by Huang et al. (2006), and was also used by Pashaei et al. (2016) to generate features for prediction of splice sites along with other features. This has similarity with Bayes kernel encoding (Zhang et al., 2006), where both frequency matrices are used for encoding instead of the difference matrix.


MN.Fdtf.Feature(positive_class, negative_class, test_seq)

Arguments

positive_class: Sequence dataset of the positive class, must be an object of class DNAStringSet.
negative_class: Sequence dataset of the negative class, must be an object of class DNAStringSet.
test_seq: Sequences to be encoded into numeric vectors, must be an object of class DNAStringSet.

Details

For getting an object of class DNAStringSet, the sequence dataset must be read in FASTA format through the function readDNAStringSet available in the Biostrings package of Bioconductor (https://bioconductor.org/packages/release/bioc/html/Biostrings.html ).

Returns

A numeric matrix of order $m*n$ , where $m$ is the number of sequences in test_seq and $n$ is the sequence length.

References

Zhang, Y., Chu, C., Chen, Y., Zha, H. and Ji, X. (2006). Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Applications, 30: 73-81.
Huang, J., Li, T., Chen, K. and Wu, J. (2006). An approach of encoding for prediction of splice sites using SVM. Biochimie, 88(7): 923-929.
Pashaei, E., Yilmaz, A., Ozen, M. and Aydin, N. (2016). Prediction of splice site using AdaBoost with a new sequence encoding approach. In Systems, Man, and Cybernetics (SMC), IEEE International Conference, pp 3853-3858.

Author(s)

Prabina Kumar Meher, Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA

Note

This feature does not take into consideration the dependencies among nucleotides in the sequence.

Examples


data(droso)
positive <- droso$positive
negative <- droso$negative
test <- droso$test
pos <- positive[1:200]
neg <- negative[1:200]
tst <- test
enc <- MN.Fdtf.Feature(positive_class=pos, negative_class=neg, test_seq=tst)
enc

EncDNA package Read PDF manual

Maintainer: Prabina Kumar Meher
License: GPL (>= 2)
Last published: 2019-05-28

Useful links

MN.Fdtf.Feature function