Detects the top topics in a group of text documents.
Detects the top topics in a group of text documents.
This function returns the top detected topics for a list of submitted text documents. A topic is identified with a key phrase, which can be one or more related words. At least 100 text documents must be submitted, however this API is designed to detect topics across hundreds to thousands of documents. For best performance, limit each document to a short, human written text paragraph such as review, conversation or user feedback.
English is the only language supported at this time.
You can provide a list of stop words to control which words or documents are filtered out. You can also supply a list of topics to exclude from the response. Finally, you can also provide min/max word frequency count thresholds to exclude rare/ubiquitous document topics.
We recommend using the textaDetectTopics function in synchronous mode, in which case it will return only after topic detection has completed. If you decide to call this function in asynchronous mode, you will need to call the textaDetectTopicsStatus function periodically yourself until the Microsoft Cognitive Services server complete topic detection and results become available.
IMPORTANT NOTE: If you're calling ‘textaDetectTopics’ in synchronousmode within the R console REPL (interactive mode), it will appear as ifthe console has hanged. This is EXPECTED. The function hasn'tcrashed. It is simply in "sleep mode", activating itself periodicallyand then going back to sleep, until the results have become available.In sleep mode, even though it appears "stuck", ‘textaDetectTopics’dodesn't use any CPU resources. While the function is operating insleep mode, you WILL NOT be able to use the console until thefunction completes. If need to operate the console while topicdetection is being performed by the Microsoft Cognitive servicesservers, you should call ‘textaDetectTopics’ in asynchronous mode andthen call ‘textaDetectTopicsStatus’ yourself repeteadly afterwards,until results are available.
Note that one transaction is charged per text document submitted.
documents: (character vector) Vector of sentences or documents on which to perform topic detection. At least 100 text documents must be submitted. English is the only language supported at this time.
stopWords: (character vector) Vector of stop words to ignore while performing topic detection (optional)
topicsToExclude: (character vector) Vector of topics to exclude from the response (optional)
minDocumentsPerWord: (integer) Words that occur in less than this many documents are ignored. Use this parameter to help exclude rare document topics. Omit to let the service choose appropriate value. (optional)
maxDocumentsPerWord: (integer) Words that occur in more than this many documents are ignored. Use this parameter to help exclude ubiquitous document topics. Omit to let the service choose appropriate value. (optional)
resultsPollInterval: (integer) Interval (in seconds) at which this function will query the Microsoft Cognitive Services servers for results (optional, default: 30L). If set to 0L, this function will return immediately and you will have to call textaDetectTopicsStatus periodically to collect results. If set to a non-zero integer value, this function will only return after all results have been collected. It does so by repeatedly calling textaDetectTopicsStatus on its own until topic detection has completed. In the latter case, you do not need to call textaDetectTopicsStatus.
resultsTimeout: (integer) Interval (in seconds) at which point this function will give up and stop querying the Microsoft Cognitive Services servers for results (optional, default: 1200L). As soon as all results are available, this function will return them to the caller. If the Microsoft Cognitive Services servers within resultsTimeout seconds, this function will stop polling the servers and return the most current results.
verbose: (logical) If set to TRUE, print every poll status to stdout.
Returns
An S3 object of the class textatopics. The results are stored in the results dataframes inside this object. See textatopics
for details. In the synchronous case (i.e., the function only returns after completion), the dataframes contain the documents, the topics, and which topics are assigned to which documents. In the asynchonous case (i.e., the function returns immediately), the dataframes contain the documents, their unique identifiers, their current operation status code, but they don't contain the topics yet, nor their assignments. To get the topics and their assignments, you must call textaDetectTopicsStatus until the Microsoft Services servers have completed topic detection.
Examples
## Not run: load("./data/yelpChineseRestaurantReviews.rda") set.seed(1234) documents <- sample(yelpChReviews$text,1000) tryCatch({# Detect top topics in group of documents topics <- textaDetectTopics( documents,# At least 100 documents (English only) stopWords =NULL,# Stop word list (optional) topicsToExclude =NULL,# Topics to exclude (optional) minDocumentsPerWord =NULL,# Threshold to exclude rare topics (optional) maxDocumentsPerWord =NULL,# Threshold to exclude ubiquitous topics (optional) resultsPollInterval =30L,# Poll interval (in s, default:30s, use 0L for async) resultsTimeout =1200L,# Give up timeout (in s, default: 1200s = 20mn) verbose =TRUE# If set to TRUE, print every poll status to stdout)# Class and structure of topics class(topics)#> [1] "textatopics" str(topics, max.level =1)#> List of 8#> $ status : chr "Succeeded"#> $ operationId : chr "30334a3e1e28406a80566bb76ff04884"#> $ operationType : chr "topics"#> $ documents :'data.frame': 1000 obs. of 2 variables:#> $ topics :'data.frame': 71 obs. of 3 variables:#> $ topicAssignments:'data.frame': 502 obs. of 3 variables:#> $ json : chr "{\"status\":\"Succeeded\",\"createdDateTime\": __truncated__ }#> $ request :List of 7#> ..- attr(*, "class")= chr "request"#> - attr(*, "class")= chr "textatopics"# Print results topics
#> textatopics [https://westus.api.cognitive.microsoft.com/text/analytics/ __truncated__ ]#> status: Succeeded#> operationId: 30334a3e1e28406a80566bb76ff04884#> operationType: topics#> topics (first 20):#> ------------------------#> keyPhrase score#> ---------------- -------#> portions 35#> noodle soup 30#> vegetables 20#> tofu 19#> garlic 17#> Eggplant 15#> Pad 15#> combo 13#> Beef Noodle Soup 13#> House 12#> entree 12#> wontons 12#> Pei Wei 12#> mongolian beef 11#> crab 11#> Panda 11#> bean 10#> dumplings 9#> veggies 9#> decor 9#> ------------------------}, error =function(err){# Print error geterrmessage()})## End(Not run)