Edoardo M. Airoldi and Stephen E. Fienberg
Peter Hannaford had been Reagan's main aide in drafting texts for the radio addresses during the years 1976-79, whereas the situation was less clear in 1975, thus we learned both how to discriminate between the writing styles of Reagan and Hannaford, and we focused on stylistic differences between Reagan and the undistinguished pool of his collaborators to properly address the prediction problem for speeches delivered in different epochs. We explored a wide range of off-the-shelf classification methods as well as fully Bayesian Poisson and Negative-Binomial models for word counts. Simple majority voting reinforced the cross-validated accuracies of our predictions on speeches of known authorship, that settled beyond 90% in most cases. We produced separate sets of predictions using the most accurate classification methods and the fully Bayesian models, for the 314 speeches whose author is uncertain. All the predictions agree on 135 of the ``unknown'' speeches, whereas the fully Bayesian models agree on 289 of them. We further approximated log-odds of authorship as a measure of the strength of our predictions.
Among the crucial issues we had to deal with were the bold difference in the number of ``known'' speeches available for each author, and the phase of word selection. In the original dataset there were 679 speeches drafted by Reagan ``in his own hand" and only 39 drafted by few close collaborators. With the help of Prof. Kiron Skinner and Prof. Annelise Anderson we looked into the Reagan files and we found 30 newspaper columns originally drafted by Peter Hannaford, but published with Reagan's signature. We coded them to obtain a set of 69 texts drafted by Reagan's collaborators, on which we based our inferences. The process of word selection was critical in order to understand ``the secrets'' of Reagan's writing style. Word counts very much fit the Negative-Binomial profile, and we relied on this fact to compute p-values for a certain statistic () in order to capture structural elements of differential writing style. We considered other criteria to find words with discriminatory power as Information Gain scores, computed for Multinomial and multivariate Bernoulli models, as well as a semantic decomposition of the speeches using Docu-Scope software. Following Frederick Mosteller and David Wallace in their analysis of ``the Federalist Papers'', we aimed for non-contextual words with possibly a few exceptions, that occurred with high, medium and low frequency. In making the decisions about contextuality a prior idea of Reagan's style based on the text of the Presidential debate Reagan vs. Carter, several notes, comments, and books about Ronald Reagan played a role. As an example consider the word Carter: our prior idea about Reagan's style suggested that Reagan would seldom talk about his opponent, Carter, his line of attack being more subtle. He would mostly address the government or capitol hill people and similar figures instead. Thus when the word Carter passed severe testing to make sure that its differential use by Reagan and Hannaford was too marked to be the outcome of pure chance, and it was likely to capture some element of Hannaford writing style, we did not discard it as contextual. Some have argued that Reagan's writing style might be better captured by some idioms he used. Thus we extended our analysis to the study of successive words to discover that, for example, idioms like if we, in our, I'd like to or in America identify Reagan's writing style beyond reasonable doubts.
We concluded that, in 1975, Ronald Reagan drafted 77 speeches, and
his collaborators drafted 71, whereas, over the years 1976-1979,
Reagan drafted 90 speeches and Hannaford drafted 74. The
cross-validated accuracy of our best fully Bayesian model based on
the Negative-Binomial distribution for word counts was above 90% in
all cases. Further our inferences were not sensible to
``reasonable'' variations in the sets of constants underlying the
prior distributions, which we bracketed with a small study on 90
high-frequency, function words. Our predictions for the speeches
whose author is uncertain are accurate and reliable, and the
agreement of several methods in predicting the author of the
``unknown'' speeches in most cases reinforced our confidence.