Count distinct words from a Pandas Data Frame

Use a set to create the sequence of unique elements.

Do some clean-up on df to get the strings in lower case and split:

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

Each list in this column can be passed to set.update function to get unique values. Use apply to do so:

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

Or use with Counter() from comments:

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)

Use collections.Counter:

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]

If you want to do it from the DataFrame construct:

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

If you want a more flexible tokenization use nltk and its tokenize

Building on @Ofir Israel's answer, specific to Pandas:

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result

Will give you what you want, this converts the text column series values to a list, splits on spaces and counts the instances.

AngularJS multiple expressions concatenating in interpolation with a URL

Can I have a route53 subdomain in a different Hosted Zone?

What is the difference between raise StopIteration and a return statement in generators?

List difference in java

VS Code / Git - is it possible to "drive a stake in the sand" at a given point in time and say "This all works"?

Weird behaviour of python with module import

When I return 1 in numberOfItemsInSection cellForItemAt not called in Swift

Could not find expected browser chrome locally

Sharing reactive data sets between user sessions in Shiny

Illegal Argument Exception in Elasticsearch 7.10.1 when Loading data

PG::DuplicateTable: ERROR: relation "posts" already exists

How to select all records from many different tables in Postgres, and return a nested tree of data?

Count distinct words from a Pandas Data Frame

Related

Recent Posts