The ratings, on the other hand, are singular in their purpose, and they are calibrated to maximize their accuracy at answering one question: who would win? A team's rating number is nothing more than an expression of a relative win probability. Take two ratings, feed them into the algorithm, and they will spit out the odds that one team or the other will win a given debate. For example, a team with a rating of 25 is expected to beat teams with ratings of 21 roughly three out of four times. This is pretty much the beginning and end of what the ratings do.
Nevertheless, I think there is value in thinking about how the ratings relate to the bid process. I do hope that the ratings can be a useful tool for voters - one metric among the many that they may consider. Furthermore, even though they aren't in any way calibrated to replicate the bid vote, the bid vote remains something of an external check on their validity. We can think of voters as something like a proxy for the collective opinion of the community (with all of the attendant problems of representation). If the ratings don't tend to correlate with bid outcomes, then there would perhaps be reason to question their usefulness (or, I suppose, the bid process itself).
Toward that end, this blog post shares some data concerning how well the ratings match up with the bid votes. The short version is that they're not perfect, but they do pretty well. The ratings are well within the range of error that we find among human voters.
Method
I collected the first and second round bid votes for each season stretching back to 2012-13 (the first year in my ratings data set). For each season, I compared each individual voter's rankings against the aggregate average of all of the voters, giving me the "error" of each voter (using RMSE for those of you who are interested). Then I created hypothetical "ballots" for how the computer ratings would have voted in each race and found their error as well. Next I calculated the average amount of error among voters and how much each voter performed above or below average. Finally, I averaged each voter's performance over the course of the past 6 years, using standard deviation to normalize the data across seasons.
Results
First Round Bids
Across all voters, the mean error for first round ballots was 1.472. Perhaps this is an oversimplification, but one way to think about this is that voters were on average off in each of their rankings by 1.472 slots (weighting to penalize larger misses more). By contrast, the computer ratings had an error of 1.759, meaning that they performed slightly worse than the average voter. However, they were still within the overall range of human error, ranking 17th out of the 21 voters in the data set -- 0.559 standard deviations below average.
Although counting "hits" and "misses" isn't a very good metric for evaluating accuracy, it's still kind of interesting to look at. The ratings have correctly chosen 15 of the 16 first round recipients in each of the last six years, missing one each year. The average hit-rate among human voters is 15.381.
Second Round Bids
In contrast to the first round data, the computer rating system performed slightly above average in its second round rankings. The mean error among voters was 3.993, while the average error of the ratings was only 3.742. The ratings were the 8th most accurate out of the 21 voters, coming out 0.359 standard deviations better than average.
I didn't calculate hits/misses for second round bids because of the complications introduced by third teams.
Final Thoughts
I went into this assuming that the ratings would do better with first round bids than second rounds. There's generally more data on first round teams, and there is greater separation between teams at the top. In contrast, teams in the middle of the pack tend to group together without much differentiation. I had assumed that the ratings would struggle more with the small differences found in the peak of the bell shaped curve.
In a strict sense, the computer ratings were more accurate with first rounds. The error for the ratings in the first round votes was less than half what it was for the second round votes. However, their performance relative to human voters flipped around.
I can only speculate why this might be the case. It's possible that factors that exist outside the strict Ws and Ls of debaters' ballots play more of a role in first round voting ("elim depth" and narrower head to head comparisons come to mind as possibilities). Similarly, its possible that the amount and/or type of data available for the second rounds just doesn't produce as clear of a hierarchy for human voters to identify, and so the ability of the ratings to assimilate a large amount of information allows for them to gain ground on the humans.
All told, the ratings seem to be a reasonable indicator for bid vote outcomes. They can't be taken as determinative, and there are certainly occasions when they are significantly off about a team (which is also true of human voters). Nevertheless, they have been pretty squarely within the range of error displayed by human voters.