weblmBreakIntoWords function

Breaks a string of concatenated words into individual words

Breaks a string of concatenated words into individual words

This function inserts spaces into a string of words lacking spaces, like a hashtag or part of a URL. Punctuation or exotic characters can prevent a string from being broken, so it's best to limit input strings to lower-case, alpha-numeric characters. The input string must be in ASCII format.

Internally, this function invokes the Microsoft Cognitive Services Web Language Model REST API documented at https://www.microsoft.com/cognitive-services/en-us/web-language-model-api/documentation.

You MUST have a valid Microsoft Cognitive Services account and an API key for this function to work properly. See https://www.microsoft.com/cognitive-services/en-us/pricing

for details.

weblmBreakIntoWords(textToBreak, modelToUse = "body", orderOfNgram = 5L, maxNumOfCandidatesReturned = 5L)

Arguments

  • textToBreak: (character) Line of text to break into words. If spaces are present, they will be interpreted as hard breaks and maintained, except for leading or trailing spaces, which will be trimmed. Must be in ASCII format.
  • modelToUse: (character) Which language model to use, supported values: "title", "anchor", "query", or "body" (optional, default: "body")
  • orderOfNgram: (integer) Which order of N-gram to use, supported values: 1L, 2L, 3L, 4L, or 5L (optional, default: 5L)
  • maxNumOfCandidatesReturned: (integer) Maximum number of candidates to return (optional, default: 5L)

Returns

An S3 object of the class weblm. The results are stored in the results dataframe inside this object. The dataframe contains the candidate breakdowns and their log(probability).

Examples

## Not run: tryCatch({ # Break a sentence into words textWords <- weblmBreakIntoWords( textToBreak = "testforwordbreak", # ASCII only modelToUse = "body", # "title"|"anchor"|"query"(default)|"body" orderOfNgram = 5L, # 1L|2L|3L|4L|5L(default) maxNumOfCandidatesReturned = 5L # Default: 5L ) # Class and structure of textWords class(textWords) #> [1] "weblm" str(textWords, max.level = 1) #> List of 3 #> $ results:'data.frame': 5 obs. of 2 variables: #> $ json : chr "{"candidates":[{"words":"test for word break", __truncated__ }]} #> $ request:List of 7 #> ..- attr(*, "class")= chr "request" #> - attr(*, "class")= chr "weblm" # Print results pandoc.table(textWords$results) #> --------------------------------- #> words probability #> ------------------- ------------- #> test for word break -13.83 #> #> test for wordbreak -14.63 #> #> testfor word break -15.94 #> #> test forword break -16.72 #> #> testfor wordbreak -17.41 #> --------------------------------- }, error = function(err) { # Print error geterrmessage() }) ## End(Not run)

Author(s)

Phil Ferriere pferriere@hotmail.com

  • Maintainer: Phil Ferriere
  • License: MIT + file LICENSE
  • Last published: 2016-06-15