You want to know why we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).
The answer to your question is: not all words have to be unknown in all classes.
Say there are two classes M and N with features A, B and C, as follows:
M: A=3, B=1, C=0
(In the class M, A appears 3 times and B only once)
N: A=0, B=1, C=3
(In the class N, C appears 3 times and B only once)
Let's see what happens when you throw away features that appear zero times.
A) Throw Away Features That Appear Zero Times In Any Class
If you throw away features A and C because they appear zero times in any of the classes, then you are only left with feature B to classify documents with.
And losing that information is a bad thing as you will see below!
If you're presented with a test document as follows:
B=1, C=3
(It contains B once and C three times)
Now, since you've discarded the features A and B, you won't be able to tell whether the above document belongs to class M or class N.
So, losing any feature information is a bad thing!
B) Throw Away Features That Appear Zero Times In All Classes
Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?
No, because that would create its own problems!
The following test document illustrates what would happen if we did that:
A=3, B=1, C=1
The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).
C) Don't Throw Anything Away - Use Smoothing Instead
Smoothing allows you to classify both the above documents correctly because:
- You do not lose count information in classes where such information is available and
- You do not have to contend with zero counts.
Naive Bayes Classifiers In Practice
The Naive Bayes classifier in NLTK used to throw away features that had zero counts in any of the classes.
This used to make it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data).