Size of the vocabulary in Laplace smoothing for a trigram language model

Solution 1:

V is the size of the vocabulary which is the number of unique unigrams.

This is because, when you smooth, your goal is to ensure a non-zero probability for any possible trigram.

Consider a corpus consisting of just one sentence: "I have a cat". You have seen trigrams: "I have a" "have a cat" (and nothing else.)

Without smoothing, you assign both a probability of 1. However, if you want to smooth, then you want a non-zero probability not just for:

"have a UNK"

but also for "have a have", "have a a", "have a I".

That's why you want to add V to the denominator.

Consider also the case of an unknown "history" bigram. You want to ensure a non-zero probability for "UNK a cat", for instance, or indeed for any word following the unknown bigram.

You've never seen the bigram "UNK a", so, not only you have a 0 in the numerator (the count of "UNK a cat") but also in the denominator (the count of "UNK a"). What probability would you like to get here, intuitively?

Since we haven't seen either the trigram or the bigram in question, we know nothing about the situation whatsoever, it would seem nice to have that probability be equally distributed across all words in the vocabulary: P(UNK a cat) would be 1/V and the probability of any word from the vocabulary following this unknown bigram would be the same. So, add 1 to numerator and V to the denominator, regardless of the N-gram model order.