I went back to the data for the 2013-14 season and calculated the results both with and without the Kentucky & Dartmouth round robins. For the most part, the results confirmed my expectations, but there were a few interesting issues that emerged along the way.
As a caveat, keep in mind that this analysis is based on the data from a single season, and even more narrowly, the results of two tournaments. It's not possible to draw truly rigorous conclusions from such a small sample. On the other hand, there are a couple of ways that we can try to minimize the limitations of the sample size, which I will get to toward the end of the post.
Quickly, a reminder concerning exactly what the Glicko ratings attempt to represent. In contrast to many other methods of rating competitors, Glicko is not primarily a representation of a team's previous competitive results. Instead, it is an estimate of a team's ability relative to other teams. It uses past results to make predictions about what future results would be. Specifically, the difference between two teams' ratings can be translated into a win probability prediction of a hypothetical round between them.
This is an important distinction because it means that ratings go up or down based not on raw success or failure, but rather how a team performs relative to expectations. As a result, a team with a high rating will not necessarily see a ratings gain after a strong performance because they were already expected to have a strong performance. It was "baked in" to their rating going in. Their rating will only go up if they perform even better than they were expected to.
This has important implications for how tournaments like the round robins would be weighed in the ratings calculation. Figuring out how to weigh a round robin has always bumped up against the fact that the difficulty is not easily comparable to a regular tournament. 5-3 at the Kentucky round robin is not equivalent to 5-3 at any other tournament, even the NDT. As a solution to this problem, some have suggested only "rewarding" success at a round robin, but not "punishing" failure. This is a strange logic to me. Should elim losses also not be considered? Any loss against a team that is considered to be good? This feels wrong to me, but I do agree that we should try to figure out a good way of incorporating these rounds in a way that best accounts for their difficulty while also not giving people an advantage for merely being present.
I think the Glicko rating system has a pretty good answer to the problem for a couple of reasons:
1. The aggregate average rating for all teams at the end of a tournament will always be nearly equivalent to the aggregate average at the beginning of the tournament. This is because ratings are determined relationally and any gain made by one team is always matched with loss by another. A team's rating will never suffer because they did not attend a specific tournament. Similarly, a team's rating will never advantaged through attendance at a specific tournament.
2. Since ratings changes are determined based on how a team performs relative to expectations, opponent strength (and thus, indirectly, the strength of the tournament) is the core element. A lower rated team will not be expected to win very many rounds, so going 1-7 or 2-6 wouldn't necessarily have much of any negative impact on their rating. However, if they went 3-5 or 4-4, then they might see a solid ratings bump despite having a mediocre record.
Looking at the 2013-14 round robins, we can evaluate how teams performed relative to expectations and then how that impacted their final ratings.
How the Round Robins Impact Ratings
Below are two tables, one for Kentucky and one for Dartmouth. The tables include each team's end of season rating, a calculation of how many wins they would be expected to gain at the tournament, and how many they actually achieved. The expected wins stat was calculated by adding together a team's win probabilities (each of which can be understood as expressing a fraction of a win) against each opponent.
At first glance we can spot a few things:
1. Michigan AP substantially outperformed expectations. Between the two tournaments, they won about 3 more rounds than expected.
2. Wake MQ was 1.24 wins short of expectations and Oklahoma LM was 1.03 wins short. Otherwise, all other teams were within a win of their expectation.
Concerning the impact that the tournament results should have on the ratings change post-tournament, we would expect that Michigan should be substantially advantaged by ratings that include the round robins, Wake & Oklahoma should be a little bit disadvantaged, and everybody else should stay about the same.
Below is a table that shows a comparison between the end-of-season (pre-NDT) ratings that include the round robins versus those that exclude the them. The above predictions are certainly born out, but there are actually a couple of unexpected developments as well...
First, Michigan definitely does benefit from inclusion of the tournaments. Removing the round robins drops them from 2nd overall to 5th (falling behind Wake's LeDuc & Washington as well, though this might not be fair because I left in the results of the Pittsburgh RR, where Wake LW did quite well).
Second, Wake and Oklahoma do benefit from excluding the round robins, but their gain is far larger than would be expected from their very modest deviations from the predictions. Wake gains a massive 71 points, enough to move from 13th overall up to 8th. Oklahoma gains only (a still large) 25 points, but that is enough to move them up 8 spots in the rankings from 24th to 16th.
Third, the impact on the majority of the teams is negligible. 8 teams stay within 1 ranking spot of where they would have been otherwise.
Fourth, somehow Harvard BS lost a couple of points with the inclusion of the round robin despite having outperformed expectations by 3/4 of a win.
Fifth, what's up with Kentucky GR?!?!?! They were only about a half a win short of expectations, but somehow they lost 31 points and 8 ranking spots at the round robin.
Finally, and most subtly, there is something very peculiar happening with the aggregate averages. Remember, above I said that the aggregate average of all teams' ratings at a tournament should be about the same at the end of the tournament as it was at the beginning. Points are basically zero sum. However, in this instance, we see that the average is 15.5 points lower when the round robins are included. This seemingly shouldn't be possible.
After some digging, I was about to figure out that what is happening in these latter three observations springs from the same source, and it presents a real problem for how to evaluate the results of the Kentucky round robin in particular.
Glicko ratings are vulnerable to overall point deflation in one specific circumstance: when very good teams have been inactive or are significantly underrated. This is one consequence of the fact that all teams (in the unweighted version of the ratings) start their first tournament with a default rating of 1500. Despite the fact that we know this rating is too low, this is not usually a problem. Since the "average" is always necessarily the midpoint of a given pool, there will always be, by definition, the same amount of ability points (though not necessarily debaters) that are above average as below. However, when we start adding new debaters to the pool at subsequent tournaments, we don't know if they're going to be above average or below average. If they're above average, then the default of 1500 will underrate them (and vice versa). The reason that this can cause deflation is because the only way for the significantly underrated team to get its proper rating is to take points from somebody else. At a large invitational tournament, the impact of this will not be particularly noticeable for a few reasons, including among other factors that there will also be a set of overrated debaters from whom points will be redistributed to make up for it.
At the 2013 Kentucky round robin, however, it was very noticeable. Here are the ratings of each team heading into the tournament (after only 1 period of tournaments had been entered into the ratings - UMKC & GSU) compared to their end of season ratings:
Overall, what we have is a very small set of debaters, some of whom are massively underrated. Oklahoma is rated 1500 because they didn't compete in a season opener, so the round robin was their first tournament. Mary Washington had an extremely poor performance at Georgia State, going 5-3 and losing in doubles. To put it in perspective, Mary Washington's true rating would make them a 6:1 favorite over a team with a rating of 1546. In contrast, only one team entered the round robin with a better rating than they would end the season with - Wake Forest - and there was also one other team that was only moderately underrated, which was Kentucky.
The effect that these discrepancies can have on the ratings is not insignificant. If everybody were underrated, it wouldn't be a big deal. They would trade points among each other, and then go back to a normal invitational tournament, where they would start stealing points away from the larger pool. However, the large difference in the accuracy of each team's ratings created a situation in which a couple of teams were losing disproportionately large amounts of points because they were wrongly predicted to win rounds that they should have been predicted to lose.
To make it more concrete, Kentucky was actually considered to be a 64% favorite against Mary Washington. Their end of season ratings would indicate that in actuality Mary Washington was a 76% favorite (better than 3:1) in that debate. The result is that when Wake Forest and Kentucky lost, they lost big.
This issue certainly does raise questions about how to best approach the inclusion of the Kentucky round robin. It honestly would not really have been as much of a problem if Mary Washington, Oklahoma and Michigan hadn't been such outliers. There would have still been some discrepancy, but the impact would have been substantially less.
For example, consider instead the ratings of the teams prior to the Dartmouth RR:
There is still some over- and under-rating, but that's how it should be or there would be no reason to have the debates. Here, the discrepancy is in fact within the expected amount of ratings deviance assigned to each team.
As a result, the effect of Dartmouth round robin results on post-tournament ratings is much more consistent with what we would expect than was Kentucky. Looking at the difference between the final season (pre-NDT) ratings when Dartmouth is included versus excluded, we see the following:
These results are much more like what we should expect. There is some movement in ratings, but nothing out of the ordinary. Mary Washington benefits from the inclusion of the round robin, which makes sense because they won 1.4 rounds more than predicted. Wake Forest loses ground because they fell .9 wins short of their prediction. Everybody else was within half a win of their prediction, so they saw only a marginal change in their rating.
How the Round Robins Impact Predictive Accuracy
As a simple test, we can compare how well each set of ratings is able to predict results at last year's NDT. Below are the mean square error and the mean absolute error of 4 different sets of ratings.
I'm not really going to get into what these averages mean here, but the upshot is that there is not a whole lot of difference. The ratings that totally exclude the round robins end up being the most accurate, but not by much. Oddly, excluding Dartmouth is comparatively better than excluding Kentucky despite the issues discussed above.
However, basing an assessment of the predictive accuracy of the ratings on how well they predict a single tournament isn't a terrifically great metric. There is a ton of noise in a single tournament. Given the small margins, it wouldn't take many upsets for the metric to be thrown off.
To solve this problem, we can run a bootstrap simulation that significantly expands our dataset. To do this, we take the entire set of 2013-14 rounds and repeatedly re-sample the data, creating as many hypothetical tournaments as we like. Then, we can see which set of ratings does the best job of predicting those hypothetical tournaments.
I ran the ratings through 10000 tournament simulations, averaging their errors.
You may notice that these numbers are substantially higher than the numbers given above regarding NDT predictions. This is due to the fact that every round at the NDT is judged by panels, whereas the results of most other debates during the season are binary. Split decisions on a panel are calculated as a fraction of a win, allowing the ratings to get significantly closer to an accurate prediction. For example, if a 67% favorite wins on a 2-1, then the ratings were essentially right on. Whereas, if a 67% favorite wins on a 1-0, then prediction is calculated as being 33% off.
For this reason, I've included an additional metric for evaluating error, called binomial deviation. This stat is designed as a better way to evaluate error when faced with binary predictions. Since the majority of the rounds in the simulations will be binary win/loss debates, this might be a better way of comparing the ratings.
Returning to the numbers, we find results much more in line with what we would have expected given the problems with the Kentucky ratings. The most accurate predictions still come from the ratings that exclude both round robins, but almost all of the error of the round robins is accounted for by merely removing the Kentucky results. In fact, when it comes to mean absolute error, the ratings that include Dartmouth produce nearly the identical outcome as those that exclude both tournaments.
Again, a cautionary note about too quickly drawing conclusions from the data. Even though I simulated a large set of tournaments to compare the ratings against, we're still only talking about the impact that 1 or 2 tournaments are having on the ratings themselves Once I get the 2012-13 data cleaned up it will offer another point of comparison, but that's the end of data available on tabroom.com. It would be possible to simulate the round robins for resampling, but I'm not sure that that will necessarily be that helpful.
What does seem clear from this analysis is that there is a need to rethink how the Kentucky data is handled. While its effect on the ratings overall is not exactly huge, the damage is clearly observable and somewhat predictable given how early the tournament is in the semester. It is notable, however, that this should only be a problem for the "unweighted" version of the ratings (for more on the difference between "weighted" and "unweighted" ratings, see the FAQ). A system that assigned weighted start values to team ratings at the beginning of the year (perhaps based on previous season ratings) would mitigate the risk of underrating.