tokenize function

Tokenize raw text for training word embeddings.