Using a Naive Bayes Classifier to classify tweets: some problems
Using, amongst other sources, various posts here on Stackoverflow, I'm trying to implement my own PHP classier to classify tweets into a positive, neutral and negative class. Before coding, I need to get the process straigt. My train-of-thought and an example are as follows:
p(class) * p(words|class)
Bayes theorem: p(class|words) = ------------------------- with
p(words)
assumption that p(words) is the same for every class leads to calculating
arg max p(class) * p(words|class) with
p(words|class) = p(word1|class) * p(word2|topic) * ... and
p(class) = #words in class / #words in total and
p(word, class) 1
p(word|class) = -------------- = p(word, class) * -------- =
p(class) p(class)
#times word occurs in class #words in total #times word occurs in class
--------------------------- * --------------- = ---------------------------
#words in total #words in class #words in class
Example:
------+----------------+-----------------+
class | words | #words in class |
------+----------------+-----------------+
pos | happy win nice | 3 |
neu | neutral middle | 2 |
neg | sad loose bad | 3 |
------+----------------+-----------------+
p(pos) = 3/8
p(neu) = 2/8
p(meg) = 3/8
Calculate: argmax(sad loose)
p(sad loose|pos) = p(sad|pos) * p(loose|pos) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neu) = p(sad|neu) * p(loose|neu) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neg) = p(sad|neg) * p(loose|neg) = 1/3 * 1/3 = 1/9
p(pos) * p(sad loose|pos) = 3/8 * 1/9 = 0.0416666667
p(neu) * p(sad loose|neu) = 2/8 * 1/9 = 0.0277777778
p(neg) * p(sad loose|neg) = 3/8 * 1/9 = 0.0416666667 <-- should be 100% neg!
As you can see, I have "trained" the classifier with a positive ("happy win nice"), a neutral ("neutral middle") and a negative ("sad loose bad") tweet. In order to prevent problems of having probabilities of zero because of one word missing in all classes, I'm using LaPlace (or ädd one") smoothing, see "(0+1)".
I basically have two questions:
- Is this a correct blueprint for implementation? Is there room for improvement?
- When classifying a tweet ("sad loose"), it is expected to be 100% in class "neg" because it only contains negative words. The LaPlace smoothing is however making things more complicated: class pos and neg have an equal probability. Is there a workaround for this?
There are two main elements to improve in your reasoning.
First, you should improve your smoothing method:
- When applying Laplace smoothing, it should be applied to all measurements, not just to those with zero denominator.
- In addition, Laplace smoothing for such cases is usually given by (c+1)/(N+V), where V is the vocabulary size (e.g., see in Wikipedia).
Therefore, using probability function you have defined (which might not be the most suitable, see below):
p(sad loose|pos) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neu) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neg) = (1+1)/(3+8) * (1+1)/(3+8) = 4/121 <-- would become argmax
In addition, a more common way of calculating the probability in the first place, would be by:
(number of tweets in class containing term c) / (total number of tweets in class)
For instance, in the limited trainset given above, and disregarding smoothing, p(sad|pos) = 0/1 = 0, and p(sad|neg) = 1/1 = 1. When the trainset size increases, the numbers would be more meaningful. e.g. if you had 10 tweets for the negative class, with 'sad' appearing in 4 of them, then p(sad|neg) would have been 4/10.
Regarding the actual number outputted by the Naive Bayes algorithm: you shouldn't expect the algorithm to assign actual probability to each class; rather, the category order is of more importance. Concretely, using the argmax would give you the algorithm's best guess for the class, but not the probability for it. Assigning probabilities to NB results is another story; for example, see an article discussing this issue.