Department of Statistics Unitmark
Dietrich College of Humanities and Social Sciences

Learning Text from the Web: An Application of Classification Methods

Publication Date

June, 1999

Publication Type

Tech Report


Stella M. Salvatierra


The WWW contains a huge amount of information but understands little. Many WWW searching mechanisms produce poor results. One approach that appears to have much more efficient text retrieval is the system RAINBOW, created by the CMU CS Text-Learning group. RAINBOW's approach for statistical text classification is based on the naive Bayes classifier and its results are impressive. This report demonstrates that the "standard" statistical tools of discriminant analysis lead to results that are even better than those obtained from the naive Bayes approach and also emphasizes that adding "non informative" words increases the noise of the model producing poor results.