JUDGING AN ENGINEERING COMPETITION
----------------------------------

The files "scores.dat" and "matscores.dat" contain the results,
formatted in two different ways, of an engineering undergraduate
research competition.  Each of 15 students wrote a short abstract of
their project and gave an oral presentation to five faculty judges.
Projects were graded by the judges on a scale of 5 (poor) to 35
(excellent) [these scores are the sum of scores on a 5-point scale in
each of seven judging categories].  Projects were judged in two
sessions, projects 1 to 8 in the morning and projects 9 to 15 in the
afternoon.  A different group of five judges graded each session.
Here is "matscores":

> matscores _ read.table("matscores.dat")

> matscores
    j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 
 s1 25 30 26 22 25 NA NA NA NA  NA
 s2 17 17 18 15 20 NA NA NA NA  NA
 s3 28 26 27 26 27 NA NA NA NA  NA
 s4 21 26 23 22 23 NA NA NA NA  NA
 s5 25 24 23 26 24 NA NA NA NA  NA
 s6 23 31 20 25 28 NA NA NA NA  NA
 s7 20 19 24 23 25 NA NA NA NA  NA
 s8 20 16 19 21 25 NA NA NA NA  NA
 s9 NA NA NA NA NA 18 20 20 22  15
s10 NA NA NA NA NA 18 15 18 12  11
s11 NA NA NA NA NA 26 24 25 33  30
s12 NA NA NA NA NA 27 20 25 27  24
s13 NA NA NA NA NA 26 22 20 22  26
s14 NA NA NA NA NA 27 24 20 32  24
s15 NA NA NA NA NA 19 15 22 31  30

so, for example, the "18" in row 2 column 3 means that judge 3 gave
student 2 a score of 18 points.  

The data file "scores.dat" lists the same data in column format, for
use with "lm()", "aov()", etc. (e.g., the line "18 s2 j3" in
"scores.dat" also means that judge 3 gave student 2 a score of 18
points, and "NA s11 j4" means judge 4 did not rate student 11.

> scores _ read.table("scores.dat",header=T)
> scores[1:5,]                              
  score student judge 
1    25      s1    j1
2    17      s2    j1
3    28      s3    j1
4    21      s4    j1
5    25      s5    j1
 [ etc.]

The goal is to give out first, second and third prizes in the
engineering competition.

IDEA 1:
------

Simply rank the students by their average observed scores.

Here are two ways to do this:

> -sort(-apply(matscores,1,mean,na.rm=T))
  s11   s3   s1   s6  s14  s12   s5  s15  s13 s4   s7   s8 s9   s2  s10 
 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8

> -sort(-sapply(split(scores$score,scores$student),mean,na.rm=T))
  s11   s3   s1  s14   s6  s12   s5  s15  s13 s4   s7   s8 s9   s2  s10 
 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8

[as an Splus exericise, what is the second one doing??]

IDEA 2:
------

First, guess ("impute") the missing values in the data set.  One
simple guess is to substitute the grand mean of the observed data for
all the missing data.

> missing _ is.na(scores$score)   
> scores$score[missing] _ mean(scores$score,na.rm=T)
> round(matrix(scores$score,ncol=10),1)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
 [1,] 25.0 30.0 26.0 22.0 25.0 22.9 22.9 22.9 22.9  22.9
 [2,] 17.0 17.0 18.0 15.0 20.0 22.9 22.9 22.9 22.9  22.9
 [3,] 28.0 26.0 27.0 26.0 27.0 22.9 22.9 22.9 22.9  22.9
 [4,] 21.0 26.0 23.0 22.0 23.0 22.9 22.9 22.9 22.9  22.9
 [5,] 25.0 24.0 23.0 26.0 24.0 22.9 22.9 22.9 22.9  22.9
 [6,] 23.0 31.0 20.0 25.0 28.0 22.9 22.9 22.9 22.9  22.9
 [7,] 20.0 19.0 24.0 23.0 25.0 22.9 22.9 22.9 22.9  22.9
 [8,] 20.0 16.0 19.0 21.0 25.0 22.9 22.9 22.9 22.9  22.9
 [9,] 22.9 22.9 22.9 22.9 22.9 18.0 20.0 20.0 22.0  15.0
[10,] 22.9 22.9 22.9 22.9 22.9 18.0 15.0 18.0 12.0  11.0
[11,] 22.9 22.9 22.9 22.9 22.9 26.0 24.0 25.0 33.0  30.0
[12,] 22.9 22.9 22.9 22.9 22.9 27.0 20.0 25.0 27.0  24.0
[13,] 22.9 22.9 22.9 22.9 22.9 26.0 22.0 20.0 22.0  26.0
[14,] 22.9 22.9 22.9 22.9 22.9 27.0 24.0 20.0 32.0  24.0
[15,] 22.9 22.9 22.9 22.9 22.9 19.0 15.0 22.0 31.0  30.0

and then compute the winners as before:

> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3   s1   s6  s14  s12   s5  s15 s13   s4   s7   s8   s9   s2  s10 
 25.2 24.8 24.2 24.1 24.1 23.7 23.6 23.1  23 22.9 22.5 21.5 20.9 20.1 18.8

IDEA 3:
------

Use lm() to improve the imputations

> fit _ lm(score ~ student + judge,data=scores)
> scores$score[missing] _ predict(fit,scores[missing,])

> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3   s1  s14   s6  s12   s5  s15  s13   s4   s7   s8 s9   s2  s10 
 26.5 25.7 24.8 24.8 24.7 24.2 23.9 23.3 23.2 22.9 22.3 20.8 20 18.7 16.9

What happens if we repeat this?

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1 s6  s12   s5  s15  s13   s4   s7   s8   s9 s2  s10 
 27.1 26.2 25.2 25.2 25 24.5 24.1 23.4 23.3 22.9 22.2 20.4 19.6 18 15.9

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2  s10 
 27.4 26.4 25.4 25.3 25.1 24.6 24.2 23.5 23.3 22.9 22.1 20.2 19.4 17.6 15.4

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2  s10 
 27.6 26.5 25.5 25.4 25.2 24.7 24.2 23.5 23.3 22.9 22.1 20.2 19.3 17.4 15.2

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2  s10 
 27.7 26.6 25.5 25.4 25.2 24.7 24.2 23.5 23.3 22.9 22.1 20.1 19.2 17.4 15.1

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2 s10 
 27.7 26.6 25.5 25.4 25.2 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3  15

> fit _ lm(score ~ student + judge,data=scores)                           
> scores$score[missing] _ predict(fit,scores[missing,])                   
> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2 s10 
 27.7 26.6 25.5 25.5 25.3 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3  15


Finally the mean scores stabilize (the ranks already stabilized)...

Two things to note:

1. This is an example of an E-M algorithm:

   E-step: scores$score[missing] _ predict(fit,scores[missing,])    
   M-step: fit _ lm(score ~ student + judge,data=scores)

2. Get different answers depending on the method!

(a) Mean of observed data:

> -sort(-sapply(split(scores$score,scores$student),mean,na.rm=T))
  s11   s3   s1  s14   s6  s12   s5  s15  s13 s4   s7   s8 s9   s2  s10 
 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8

(b) Grand mean imputation:

> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3   s1   s6  s14  s12   s5  s15 s13   s4   s7   s8   s9   s2  s10 
 25.2 24.8 24.2 24.1 24.1 23.7 23.6 23.1  23 22.9 22.5 21.5 20.9 20.1 18.8

(c) E-M (one step):

> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3   s1  s14   s6  s12   s5  s15  s13   s4   s7   s8 s9   s2  s10 
 26.5 25.7 24.8 24.8 24.7 24.2 23.9 23.3 23.2 22.9 22.3 20.8 20 18.7 16.9

(d) Converged E-M:

> round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1)
  s11   s3  s14   s1   s6  s12   s5  s15  s13   s4   s7   s8   s9   s2 s10 
 27.7 26.6 25.5 25.5 25.3 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3  15