How Big is the World Wide Web?

Adrian Dobra


Considerable efforts have been dedicated to the development of sound procedures for assessing the size of the World Wide Web. The problem is compounded by the fact that sampling directly from the Web is not possible. Several groups of researchers have found sampling schemes which consist of running a number of queries on several major search engines. Although the quality of the datasets is as good as it gets, the methods the experimenters employed are not satisfactory. In this paper we present new approaches to analyze datasets collected by query-based sampling, approaches founded on a hierarchical Bayes formulation of the Rasch model. We show that our procedures abide by the real-world constraints and consequently they let us make more credible inferences.

Keywords: World Wide Web evaluation; Clustering; Contingency tables; Rasch model; Markov chain Monte Carlo methods.

Heidi Sestrich
Here is the full postscript text for this technical report. It is 6438921 bytes long.