Pandas: How to Create Frequency Distribution of Elements Ocurring in Dataframe

I have a Pandas dataframe with three columns: sentence, key phrases, category. The key phrases column contains either an empty list or words/phrases from the sentence row it comes from, like so:

Sentence	Key Phrases	Category
the red ball	`['red ball']`	object
a big blue box	`['blue']`	object
he throws the red ball	`['he throws','red ball']`	action

I want to check the contents of the entire key phrases column and build a frequency dictionary (or whatever is best) for every unique phrase. So in my example I'd have something like: 'red ball': 2, 'blue': 1, 'he throws': 1

Then I want to calculate the frequency distribution of these key phrases across all categories in the data frame. So in my example, object category is 100% of 'blue' ocurrences, but only 50% of 'red ball'. I am assuming the best way to do this is starting with the frequency dictionary I mentioned above?

Finally, I'd like to add another column to the dataframe which will show, for each key phrase in its row, what percentage of that key phrase's occurences exist within that category.

So the final DF would look something like this, though the aesthetic doesnt matter as long as the information is there:

Sentence	Key Phrases	Category	Key Phrase Ocurrences
the red ball	`['red ball']`	object	red ball: 50%
a big blue box	`['blue']`	object	blue: 100%
he throws the red ball	`['he throws', 'red ball']`	action	he throws: 100%, red ball: 50%

It would also be useful to have something like a dictionary where each key was the category and each value contained all the key phrases occurring within that category and their prevalence, so maybe this would be in the initial dictionary I'd create?

You can try

df['Key Phrase Ocurrences'] = 100 * df.nunique(axis = 1)/df.count(axis = 1)

Pandas: How to Create Frequency Distribution of Elements Ocurring in Dataframe

Related

Recent Posts