Term extraction tool from textual fields of a manuscript
Term extraction tool from textual fields of a manuscript
It extracts terms from a text field (abstract, title, author's keywords, etc.) of a bibliographic data frame.
termExtraction( M, Field ="TI", ngrams =1, stemming =FALSE, language ="english", remove.numbers =TRUE, remove.terms =NULL, keep.terms =NULL, synonyms =NULL, verbose =TRUE)
Arguments
M: is a data frame obtained by the converting function convert2df. It is a data matrix with cases corresponding to articles and variables to Field Tag in the original WoS or SCOPUS file.
Field: is a character object. It indicates the field tag of textual data :
"TI"
Manuscript title
"AB"
Manuscript abstract
"ID"
Manuscript keywords plus
"DE"
Manuscript author's keywords
The default is Field = "TI".
ngrams: is an integer between 1 and 3. It indicates the type of n-gram to extract from texts. An n-gram is a contiguous sequence of n terms. The function can extract n-grams composed by 1, 2, 3 or 4 terms. Default value is ngrams=1.
stemming: is logical. If TRUE the Porter Stemming algorithm is applied to all extracted terms. The default is stemming = FALSE.
language: is a character. It is the language of textual contents ("english", "german","italian","french","spanish"). The default is language="english".
remove.numbers: is logical. If TRUE all numbers are deleted from the documents before term extraction. The default is remove.numbers = TRUE.
remove.terms: is a character vector. It contains a list of additional terms to delete from the corpus after term extraction. The default is remove.terms = NULL.
keep.terms: is a character vector. It contains a list of compound words "formed by two or more terms" to keep in their original form in the term extraction process. The default is keep.terms = NULL.
synonyms: is a character vector. Each element contains a list of synonyms, separated by ";", that will be merged into a single term (the first word contained in the vector element). The default is synonyms = NULL.
verbose: is logical. If TRUE the function prints the most frequent terms extracted from documents. The default is verbose=TRUE.
Returns
the bibliometric data frame with a new column containing terms about the field tag indicated in the argument Field.
Examples
# Example 1: Term extraction from titlesdata(scientometrics, package ="bibliometrixData")# vector of compound wordskeep.terms <- c("co-citation analysis","bibliographic coupling")# term extractionscientometrics <- termExtraction(scientometrics, Field ="TI", ngrams =1,remove.numbers=TRUE, remove.terms=NULL, keep.terms=keep.terms, verbose=TRUE)# terms extracted from the first 10 titlesscientometrics$TI_TM[1:10]#Example 2: Term extraction from abstractsdata(scientometrics)# term extractionscientometrics <- termExtraction(scientometrics, Field ="AB", ngrams =2, stemming=TRUE,language="english", remove.numbers=TRUE, remove.terms=NULL, keep.terms=NULL, verbose=TRUE)# terms extracted from the first abstractscientometrics$AB_TM[1]# Example 3: Term extraction from keywords with synonymsdata(scientometrics)# vector of synonyms synonyms <- c("citation; citation analysis","h-index; index; impact factor")# term extractionscientometrics <- termExtraction(scientometrics, Field ="ID", ngrams =1,synonyms=synonyms, verbose=TRUE)
See Also
convert2df to import and convert an WoS or SCOPUS Export file in a bibliographic data frame.