Pandas: How to Create Frequency Distribution of Elements Ocurring in Dataframe
I have a Pandas dataframe with three columns: sentence, key phrases, category. The key phrases column contains either an empty list or words/phrases from the sentence row it comes from, like so:
Sentence | Key Phrases | Category |
---|---|---|
the red ball | ['red ball'] |
object |
a big blue box | ['blue'] |
object |
he throws the red ball | ['he throws','red ball'] |
action |
I want to check the contents of the entire key phrases column and build a frequency dictionary (or whatever is best) for every unique phrase. So in my example I'd have something like: 'red ball': 2, 'blue': 1, 'he throws': 1
Then I want to calculate the frequency distribution of these key phrases across all categories in the data frame. So in my example, object category is 100% of 'blue' ocurrences, but only 50% of 'red ball'. I am assuming the best way to do this is starting with the frequency dictionary I mentioned above?
Finally, I'd like to add another column to the dataframe which will show, for each key phrase in its row, what percentage of that key phrase's occurences exist within that category.
So the final DF would look something like this, though the aesthetic doesnt matter as long as the information is there:
Sentence | Key Phrases | Category | Key Phrase Ocurrences |
---|---|---|---|
the red ball | ['red ball'] |
object | red ball: 50% |
a big blue box | ['blue'] |
object | blue: 100% |
he throws the red ball | ['he throws', 'red ball'] |
action | he throws: 100%, red ball: 50% |
It would also be useful to have something like a dictionary where each key was the category and each value contained all the key phrases occurring within that category and their prevalence, so maybe this would be in the initial dictionary I'd create?
You can try
df['Key Phrase Ocurrences'] = 100 * df.nunique(axis = 1)/df.count(axis = 1)