Learning Text from the Web: An Application of Classification Methods

Stella M. Salvatierra

Abstract:

The WWW contains a huge amount of information but understands little. Many WWW searching mechanisms produce poor results. One approach that appears to have much more efficient text retrieval is the system RAINBOW, created by the CMU CS Text-Learning group. RAINBOW's approach for statistical text classification is based on the naive Bayes classifier and its results are impressive. This report demonstrates that the ``standard'' statistical tools of discriminant analysis lead to results that are even better than those obtained from the naive Bayes approach and also emphasizes that adding ``non informative'' words increases the noise of the model producing poor results.

Keywords: Discriminant analysis, Logistic regression, Bayes classifier, Machine learning, WWW searches

Here is the full postscript text for this technical report. It is 784683 bytes long.