JUDGING AN ENGINEERING COMPETITION ---------------------------------- The files "scores.dat" and "matscores.dat" contain the results, formatted in two different ways, of an engineering undergraduate research competition. Each of 15 students wrote a short abstract of their project and gave an oral presentation to five faculty judges. Projects were graded by the judges on a scale of 5 (poor) to 35 (excellent) [these scores are the sum of scores on a 5-point scale in each of seven judging categories]. Projects were judged in two sessions, projects 1 to 8 in the morning and projects 9 to 15 in the afternoon. A different group of five judges graded each session. Here is "matscores": > matscores _ read.table("matscores.dat") > matscores j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 s1 25 30 26 22 25 NA NA NA NA NA s2 17 17 18 15 20 NA NA NA NA NA s3 28 26 27 26 27 NA NA NA NA NA s4 21 26 23 22 23 NA NA NA NA NA s5 25 24 23 26 24 NA NA NA NA NA s6 23 31 20 25 28 NA NA NA NA NA s7 20 19 24 23 25 NA NA NA NA NA s8 20 16 19 21 25 NA NA NA NA NA s9 NA NA NA NA NA 18 20 20 22 15 s10 NA NA NA NA NA 18 15 18 12 11 s11 NA NA NA NA NA 26 24 25 33 30 s12 NA NA NA NA NA 27 20 25 27 24 s13 NA NA NA NA NA 26 22 20 22 26 s14 NA NA NA NA NA 27 24 20 32 24 s15 NA NA NA NA NA 19 15 22 31 30 so, for example, the "18" in row 2 column 3 means that judge 3 gave student 2 a score of 18 points. The data file "scores.dat" lists the same data in column format, for use with "lm()", "aov()", etc. (e.g., the line "18 s2 j3" in "scores.dat" also means that judge 3 gave student 2 a score of 18 points, and "NA s11 j4" means judge 4 did not rate student 11. > scores _ read.table("scores.dat",header=T) > scores[1:5,] score student judge 1 25 s1 j1 2 17 s2 j1 3 28 s3 j1 4 21 s4 j1 5 25 s5 j1 [ etc.] The goal is to give out first, second and third prizes in the engineering competition. IDEA 1: ------ Simply rank the students by their average observed scores. Here are two ways to do this: > -sort(-apply(matscores,1,mean,na.rm=T)) s11 s3 s1 s6 s14 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8 > -sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)) s11 s3 s1 s14 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8 [as an Splus exericise, what is the second one doing??] IDEA 2: ------ First, guess ("impute") the missing values in the data set. One simple guess is to substitute the grand mean of the observed data for all the missing data. > missing _ is.na(scores$score) > scores$score[missing] _ mean(scores$score,na.rm=T) > round(matrix(scores$score,ncol=10),1) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 25.0 30.0 26.0 22.0 25.0 22.9 22.9 22.9 22.9 22.9 [2,] 17.0 17.0 18.0 15.0 20.0 22.9 22.9 22.9 22.9 22.9 [3,] 28.0 26.0 27.0 26.0 27.0 22.9 22.9 22.9 22.9 22.9 [4,] 21.0 26.0 23.0 22.0 23.0 22.9 22.9 22.9 22.9 22.9 [5,] 25.0 24.0 23.0 26.0 24.0 22.9 22.9 22.9 22.9 22.9 [6,] 23.0 31.0 20.0 25.0 28.0 22.9 22.9 22.9 22.9 22.9 [7,] 20.0 19.0 24.0 23.0 25.0 22.9 22.9 22.9 22.9 22.9 [8,] 20.0 16.0 19.0 21.0 25.0 22.9 22.9 22.9 22.9 22.9 [9,] 22.9 22.9 22.9 22.9 22.9 18.0 20.0 20.0 22.0 15.0 [10,] 22.9 22.9 22.9 22.9 22.9 18.0 15.0 18.0 12.0 11.0 [11,] 22.9 22.9 22.9 22.9 22.9 26.0 24.0 25.0 33.0 30.0 [12,] 22.9 22.9 22.9 22.9 22.9 27.0 20.0 25.0 27.0 24.0 [13,] 22.9 22.9 22.9 22.9 22.9 26.0 22.0 20.0 22.0 26.0 [14,] 22.9 22.9 22.9 22.9 22.9 27.0 24.0 20.0 32.0 24.0 [15,] 22.9 22.9 22.9 22.9 22.9 19.0 15.0 22.0 31.0 30.0 and then compute the winners as before: > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s1 s6 s14 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 25.2 24.8 24.2 24.1 24.1 23.7 23.6 23.1 23 22.9 22.5 21.5 20.9 20.1 18.8 IDEA 3: ------ Use lm() to improve the imputations > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s1 s14 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 26.5 25.7 24.8 24.8 24.7 24.2 23.9 23.3 23.2 22.9 22.3 20.8 20 18.7 16.9 What happens if we repeat this? > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.1 26.2 25.2 25.2 25 24.5 24.1 23.4 23.3 22.9 22.2 20.4 19.6 18 15.9 > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.4 26.4 25.4 25.3 25.1 24.6 24.2 23.5 23.3 22.9 22.1 20.2 19.4 17.6 15.4 > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.6 26.5 25.5 25.4 25.2 24.7 24.2 23.5 23.3 22.9 22.1 20.2 19.3 17.4 15.2 > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.7 26.6 25.5 25.4 25.2 24.7 24.2 23.5 23.3 22.9 22.1 20.1 19.2 17.4 15.1 > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.7 26.6 25.5 25.4 25.2 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3 15 > fit _ lm(score ~ student + judge,data=scores) > scores$score[missing] _ predict(fit,scores[missing,]) > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.7 26.6 25.5 25.5 25.3 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3 15 Finally the mean scores stabilize (the ranks already stabilized)... Two things to note: 1. This is an example of an E-M algorithm: E-step: scores$score[missing] _ predict(fit,scores[missing,]) M-step: fit _ lm(score ~ student + judge,data=scores) 2. Get different answers depending on the method! (a) Mean of observed data: > -sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)) s11 s3 s1 s14 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.6 26.8 25.6 25.4 25.4 24.6 24.4 23.4 23.2 23 22.2 20.2 19 17.4 14.8 (b) Grand mean imputation: > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s1 s6 s14 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 25.2 24.8 24.2 24.1 24.1 23.7 23.6 23.1 23 22.9 22.5 21.5 20.9 20.1 18.8 (c) E-M (one step): > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s1 s14 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 26.5 25.7 24.8 24.8 24.7 24.2 23.9 23.3 23.2 22.9 22.3 20.8 20 18.7 16.9 (d) Converged E-M: > round(-sort(-sapply(split(scores$score,scores$student),mean,na.rm=T)),1) s11 s3 s14 s1 s6 s12 s5 s15 s13 s4 s7 s8 s9 s2 s10 27.7 26.6 25.5 25.5 25.3 24.7 24.3 23.5 23.3 22.9 22.1 20.1 19.2 17.3 15