How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

Solution 1:

In short:

df['Text'].apply(word_tokenize)

Or if you want to add another column to store the tokenized list of strings:

df['tokenized_text'] = df['Text'].apply(word_tokenize) 

There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

To use nltk.tokenize.TweetTokenizer:

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)

Similar to:

  • How to apply pos_tag_sents() to pandas dataframe efficiently

  • how to use word_tokenize in data frame

  • How to apply pos_tag_sents() to pandas dataframe efficiently

  • Tokenizing words into a new column in a pandas dataframe

  • Run nltk sent_tokenize through Pandas dataframe

  • Python text processing: NLTK and pandas