On March 13, 1999, a highly-anticipated prizefight between heavyweight champions Evander Holyfield and Lennox Lewis was ruled a draw by the three official judges for the bout. Many observers of the fight felt that Lewis had clearly outperformed Holyfield; dissatisfaction with the result--particularly the pro-Holyfield scorecard of judge Eugenia Williams--fueled speculation that the fight had been fixed and prompted local and state investigations. In this paper, we examine whether the official judges scored the fight in a significantly different way than other professional observers of the fight; we do so by comparing the official scorecards with those returned by other sources, including the fight's broadcast team and other sports media outlets.
We look at methods of analyzing the round-by-round scoring within the context of inter-rater agreement. The literature on inter-rater agreement typically considers a large number of samples rated by a small number of judges, and relies on asymptotic results for tests. In our case, the sample size is clearly too small for any asymptotics to apply. Instead, we investigate a number of techniques that can be applied to small-sample inter-rater agreement problems, including logistic regression, an exact test, and some Bayesian approaches.