How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

python pandas nltk tokenize twitter

Solution 1:

In short:

df['Text'].apply(word_tokenize)

Or if you want to add another column to store the tokenized list of strings:

df['tokenized_text'] = df['Text'].apply(word_tokenize)

There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

To use nltk.tokenize.TweetTokenizer:

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)

Similar to:

How to apply pos_tag_sents() to pandas dataframe efficiently
how to use word_tokenize in data frame
How to apply pos_tag_sents() to pandas dataframe efficiently
Tokenizing words into a new column in a pandas dataframe
Run nltk sent_tokenize through Pandas dataframe
Python text processing: NLTK and pandas

Related

Recent Posts

org.apache.kafka.common.errors.TimeoutException: Topic not present in metadata after 60000 ms

Why my code runs infinite time when i entered non integer type in c++ [duplicate]

How to retrieve Instagram username from User ID?

Serverless Framework - Variables resolution error

How do we access a file in github repo inside our azure databricks notebook