Big Data Preprocessing Architecture
Class to convert the data field of an Instance to lower case
Class to find and/or replace the abbreviations on the data field of an...
Write messages to the log at a given priority level using the custom b...
Object to handle the keys/attributes/options common to all pipeline fl...
Class to manage the preprocess of the files throughout the flow of pip...
Class to manage the connections with YouTube
Class to find and/or replace the contractions on the data field of a I...
Class implementing a default pipelining process.
Class implementing a dynamic pipelining process
Class to handle email files with eml extension
Class to handle the creation of Instance types
Class to handle SMS files with tsms extension
Class to handle comments of YouTube files with ytbid extension
Class to obtain the source field of an Instance
Class to find and/or replace the emoji on the data field of an Instanc...
Class to find and/or remove the emoticons on the data field of an Inst...
Class to find and/or remove the hashtags on the data field of an Insta...
Class to find and/or remove the URLs on the data field of an Instance
Class to find and/or remove the users on the data field of an Instance
Abstract super class that handles the management of the Pipes
Abstract super class implementing the pipelining process
Class to obtain the date field of an Instance
Class to guess the language of an Instance
Abstract super class that handles the management of the Instances
Class to find and/or remove the interjections on the data field of an ...
Class to obtain the length of the data field of an Instance
bdpar customized forward-pipe operator
Class that handles different types of resources
Initiates the pipelining process
Class to find and/or replace the slangs on the data field of an Instan...
Class to find and/or remove the stop words on the data field of an Ins...
Class to get the file's extension field of an Instance
Class to get the target field of the Instance
Class to handle a CSV with the properties field of the preprocessed In...
Provide a tool to easily build customized data flows to pre-process large volumes of information from different sources. To this end, 'bdpar' allows to (i) easily use and create new functionalities and (ii) develop new data source extractors according to the user needs. Additionally, the package provides by default a predefined data flow to extract and pre-process the most relevant information (tokens, dates, ... ) from some textual sources (SMS, Email, YouTube comments).