696

**Learning Text from the Web: An Application of
Classification Methods**

**Stella M. Salvatierra**

### Abstract:

The WWW contains a huge amount of information but understands
little. Many WWW searching mechanisms produce poor
results. One approach that appears to have much more efficient
text retrieval is
the system RAINBOW, created by the CMU CS
Text-Learning group. RAINBOW's approach for statistical text
classification is based on
the naive Bayes classifier and its results are impressive. This report
demonstrates that the ``standard''
statistical tools of discriminant analysis lead to
results that are even better than those obtained from the naive
Bayes approach and also emphasizes that adding ``non informative'' words
increases the noise of the model producing poor results.

*Keywords:* Discriminant analysis, Logistic regression, Bayes
classifier, Machine learning, WWW searches

Here is the full postscript text for this
technical report. It is 784683 bytes long.