How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?
Solution 1:
In short:
df['Text'].apply(word_tokenize)
Or if you want to add another column to store the tokenized list of strings:
df['tokenized_text'] = df['Text'].apply(word_tokenize)
There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual
To use nltk.tokenize.TweetTokenizer
:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)
Similar to:
How to apply pos_tag_sents() to pandas dataframe efficiently
how to use word_tokenize in data frame
How to apply pos_tag_sents() to pandas dataframe efficiently
Tokenizing words into a new column in a pandas dataframe
Run nltk sent_tokenize through Pandas dataframe
Python text processing: NLTK and pandas