Perhaps more importantly, I have also posted a table of weighted averages that combines the data on each judge over the past three years to show what their average points have been and how they have trended this year (or the most recent year for which we have data on them). This table progressively weights the data so that more recent seasons count more than previous ones.

For a more detailed description of how to read the tables, refer to my previous post from April.]]>

I have posted a table breaking down how each judge in the community distributes speaker points under the "Speaker Points" tab.

My goal is to provide people more information about how speaker points get assigned so that hopefully we can all make more informed decisions. My hope is that this information is not used to single out any particular judge or judges for criticism. Instead, it is an attempt to make a relatively opaque process somewhat more transparent. Assigning speaker points is not an exact science. Nor is it completely arbitrary or capricious. Hopefully, judges can use this information to better understand how their points relate to the community at large.

To be clear, I do not believe that there is any such thing as "correct" points. Similarly, there is no single rubric for what counts as a good speaker. Every judge values different things about a speaker, and that should be celebrated. Furthermore, beyond random variance, there may be a good reason for a judge's points to diverge if they value qualities in speakers that are disproportionately undervalued by the rest of the community. The aim in normalizing point distributions is not to get everybody to agree about what counts as a good speaker. Rather, it is to get everybody to use a common language in scoring. We may disagree about what "good" means, but for speaker points to work, we need to know that when I think a speech is good that I'm giving similar points as you are when you think a speech is good.

I apologize that the table is not necessarily presented in a format that is super easy to understand without some basic knowledge of statistics, but there is a glossary that defines each of the categories. Furthermore, to help clarify, I will work through my own line as an example.

My goal is to provide people more information about how speaker points get assigned so that hopefully we can all make more informed decisions. My hope is that this information is not used to single out any particular judge or judges for criticism. Instead, it is an attempt to make a relatively opaque process somewhat more transparent. Assigning speaker points is not an exact science. Nor is it completely arbitrary or capricious. Hopefully, judges can use this information to better understand how their points relate to the community at large.

To be clear, I do not believe that there is any such thing as "correct" points. Similarly, there is no single rubric for what counts as a good speaker. Every judge values different things about a speaker, and that should be celebrated. Furthermore, beyond random variance, there may be a good reason for a judge's points to diverge if they value qualities in speakers that are disproportionately undervalued by the rest of the community. The aim in normalizing point distributions is not to get everybody to agree about what counts as a good speaker. Rather, it is to get everybody to use a common language in scoring. We may disagree about what "good" means, but for speaker points to work, we need to know that when I think a speech is good that I'm giving similar points as you are when you think a speech is good.

I apologize that the table is not necessarily presented in a format that is super easy to understand without some basic knowledge of statistics, but there is a glossary that defines each of the categories. Furthermore, to help clarify, I will work through my own line as an example.

Here is my breakdown:

I have judged 27 rounds this year that are included as the sample, and in those rounds the median point value that I have assigned is a 28.5. For point of reference, 28.5 is also the median point value assigned by judges across all debates, so at first glance my average points seem spot on with the community.

However, we can see that I give slightly below average points by looking at the "Deb Med" and "Med Diff" columns. "Deb Med" (or Debater Median) is the median points that the debaters that I have judged have gotten in all of their debates over the course of the year -- 28.6 meaning that they were slightly above average speakers. "Med Diff" (Median Difference) is the average that I deviate from the points that those that I judge typically receive. I have a -0.1 median difference, which means that on average, I give a tenth of a point less than what everybody else gives the same debaters.*Median difference is the simplest way to see if your points tend to deviate from average and by how much*.

The next two columns ("< Med" and "> Med") go together. They express the percentage of the time that you give points that are below ("< Med") or above ("> Med") the average points of those you judge. Ideally, these two numbers would be equal, meaning that you give out below average points as often as you give out above average points. However, we can see that my split is not even. I give out below average points 67% of the time and above average points only 17% of the time. This is consistent with what we would expect from the fact that my Median Difference is also negative.

The final four columns all go together and point to how often the judge gives points that significantly deviate from a debater's average ("SD" meaning Standard Deviation). To be clear, we should expect this to happen. Debaters are not robots. They perform inconsistently, and different judges value different things in a speaker. However, if there is a large and consistent skew toward the positive or negative, then a judge might consider whether their points are not in tune with community norms for what points generally mean. Under the "-2 SD" column, I have a 2.3, which means that 2.3% of the time I give points that are more than 2 standard deviations worse than what those debaters usually receive. 10.3% of the time I give points that are between 1 and 2 standard deviations worse, and 2.3% of the time I give points that are between 1 and 2 standard deviations better. I never gave points that were more than 2 standard deviations better. To help concretize what this means a bit, points that are outside of 2 standard deviations are at the extremes, basically what we would expect to be the highest or lowest ~2% of points that that debater will receive over the course of the year. Points that are more than 1 standard deviation are about the highest/lowest ~16%.

In sum, I gave out slightly bad points, but I should be able to address it with a fairly minor correction.

]]>However, we can see that I give slightly below average points by looking at the "Deb Med" and "Med Diff" columns. "Deb Med" (or Debater Median) is the median points that the debaters that I have judged have gotten in all of their debates over the course of the year -- 28.6 meaning that they were slightly above average speakers. "Med Diff" (Median Difference) is the average that I deviate from the points that those that I judge typically receive. I have a -0.1 median difference, which means that on average, I give a tenth of a point less than what everybody else gives the same debaters.

The next two columns ("< Med" and "> Med") go together. They express the percentage of the time that you give points that are below ("< Med") or above ("> Med") the average points of those you judge. Ideally, these two numbers would be equal, meaning that you give out below average points as often as you give out above average points. However, we can see that my split is not even. I give out below average points 67% of the time and above average points only 17% of the time. This is consistent with what we would expect from the fact that my Median Difference is also negative.

The final four columns all go together and point to how often the judge gives points that significantly deviate from a debater's average ("SD" meaning Standard Deviation). To be clear, we should expect this to happen. Debaters are not robots. They perform inconsistently, and different judges value different things in a speaker. However, if there is a large and consistent skew toward the positive or negative, then a judge might consider whether their points are not in tune with community norms for what points generally mean. Under the "-2 SD" column, I have a 2.3, which means that 2.3% of the time I give points that are more than 2 standard deviations worse than what those debaters usually receive. 10.3% of the time I give points that are between 1 and 2 standard deviations worse, and 2.3% of the time I give points that are between 1 and 2 standard deviations better. I never gave points that were more than 2 standard deviations better. To help concretize what this means a bit, points that are outside of 2 standard deviations are at the extremes, basically what we would expect to be the highest or lowest ~2% of points that that debater will receive over the course of the year. Points that are more than 1 standard deviation are about the highest/lowest ~16%.

In sum, I gave out slightly bad points, but I should be able to address it with a fairly minor correction.

Congratulations to everybody on a great season! In particular, congrats to Rutgers for their incredible performances at CEDA and the NDT, and also to Harvard for their ridiculous consistency on the way to the Copeland Award. More than that, however, congratulations to every debater that suited up and made it to a tournament. I was fortunate enough to catch a bunch of fantastic debates this year. I have been consistently excited by the quality of important and provocative scholarship that I have had the great fortune to witness being explored by so many of you.

This is the last set of ratings for the season. At some point during the summer, I will try to take stock of the current state of the ratings. Feel free to contact me directly if you have questions, concerns, or suggestions. I try to be as transparent and forthcoming as possible.

As usual, disclaimers:

- These are not my personal opinions. The algorithm is set and runs autonomously from how I may personally feel about teams. I do not put my finger on the scale.
- The ratings are determined by nothing more than the head to head outcome of debate rounds. No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments. If you beat somebody, your rating goes up and theirs goes down. If you beat somebody with a much higher rating, it goes up more. If you beat them in elims, it will go up by more than if you do so in prelims. That's it.

For a sense of what the ratings number actually means:

- A 1 point ratings advantage translates roughly into 5:4 expected odds,
- 2 points is about 3:2
- 3 points is about 2:1
- 4 points is about 3:1
- 5 points is about 4:1
- 8 points is about 9:1

Also, I am aware that Kansas's Robinson is listed twice (as are others potentially). This is because she had enough rounds with two separate partners to be listed, and I don't really want to be in the business of keeping up with all of the partnership changes that happen on every team.

Disclaimers:

- These are not my personal opinions. The algorithm is set and runs autonomously from how I may personally feel about teams. I do not put my finger on the scale.
- The ratings are determined by nothing more than the head to head outcome of debate rounds. No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments. If you beat somebody, your rating goes up and theirs goes down. If you beat somebody with a much higher rating, it goes up more. If you beat them in elims, it will go up by more than if you do so in prelims. That's it.

For a sense of what the ratings number actually means:

- A 1 point ratings advantage translates roughly into 5:4 expected odds,
- 2 points is about 3:2
- 3 points is about 2:1
- 4 points is about 3:1
- 5 points is about 4:1
- 8 points is about 9:1

Not a lot to say this time other than that there is still a lot of room for movement at the last couple tournaments of the regular season. Only about a point and a half separate the teams in the top five. This is also about the same distance separating #11 from #18. An extra win or two over quality opponents, particularly in elims, could have a big impact for these teams.

Disclaimers:

- These are not my personal opinions. The algorithm is set and runs autonomously from how I may personally feel about teams. I do not put my finger on the scale.
- The ratings are determined by nothing more than the head to head outcome of debate rounds. No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments. If you beat somebody, your rating goes up and theirs goes down. If you beat somebody with a much higher rating, it goes up more. If you beat them in elims, it will go up by more than if you do so in prelims. That's it.

For a sense of what the ratings number actually means:

- A 1 point ratings advantage translates roughly into 5:4 expected odds,
- 2 points is about 3:2
- 3 points is about 2:1
- 4 points is about 3:1
- 5 points is about 4:1
- 8 points is about 9:1

One unforseen consequence of posting the previous edition of the ratings when there were so many quality teams without the required number of rounds to be listed is that it artificially inflated the rankings of a lot of teams. As a result, many teams that were previously ranked in the top 50 dropped a number of spots without actually performing any worse. They were just bumped down as new teams were added to the list. In the future, I'll have to consider whether it might be better to just wait until the end of the first semester for the first release.

I wanted to wait until the coaches poll was out to post the new ratings. I will refrain from commenting in any detail about specific teams, but it is interesting to think about the differences in where some teams are ranked. I doubt that there is a single factor that can explain all of the instances where there is divergence between the computer rating and the human poll. However, if I were to make a couple of guesses about what might be at work, I think the following might be relevant:

- It is possible that human voters are more likely to think in terms of team performance as a "resume" or "body of work." Thus, teams that the computer ratings like because of strong head-to-head results might be disadvantaged if they have been to fewer tournaments (or less total prestigious tournaments).
- It may be possible that human voters are more likely to value "elim depth" with less regard for the specific opponents that teams defeated (or lost to). The computer ratings do give extra weight to elim wins, but what matters is *who* a team competes against in elims rather than which round they made it to. Thus, the algorithm might be more impressed with a team that took down two highly rated opponents and dropped in quarters than a team that had an easy draw to semis.
- For teams with a fewer than average number of rounds, it is possible that there could be a moderately outsized recency effect of their results. Less data makes a team's rating more volatile, which means that they can move it up (or down) more quickly.
- It might be the case that in some instances the computer algorithm could be less forgiving of teams with inconsistent results. While this was only a quick dive into the data (and there are not very many data points to compare), it appeared on first glance that teams that possessed both a high rate of error (performed against expectation more often) and a large number of total rounds (which should tend to reduce error) performed slightly worse in the computer ratings versus the human poll. Just or Unjust? You decide.
- Finally, UMKC. Pretty much down the line, the human poll valued success at the UMKC tournament less than the computer did.

I hope to get my hands on the raw data from the coaches' ballots to see how much consensus/dissensus there was among the voters. It could be useful to evaluate whether the divergence that we see with the computer rankings is within the range of human disagreement internal to the poll itself.

The usual disclaimers:

- It is still early in the season, so the ratings are subject to a fair amount of volatility, especially for teams that have a fewer number of rounds. They grow more stable over time.

For a sense of what the ratings number actually means:

- A 1 point ratings advantage translates roughly into 5:4 expected odds,
- 2 points is about 3:2
- 3 points is about 2:1
- 4 points is about 3:1
- 5 points is about 4:1
- 8 points is about 9:1

If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you. The deviation for your rating has to be below 2.2 to be listed (which roughly amounts to around 18 to 20 rounds). I count somewhere around 25 teams that might have a good enough rating but lack the number of rounds necessary to give the system enough confidence.

If you feel strongly that some teams are not correctly ranked, consider:

- The ratings are determined by nothing more than the head to head outcome of debate rounds. No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments. If you beat somebody, your rating goes up and theirs goes down. If you beat somebody with a much higher rating, it goes up more. If you beat them in elims, it will go up by more than if you do so in prelims. That's it. If you want all the gory details, follow the link given below.
- The quality of the ratings is limited by the quantity and quality of the data available. It is still early in the season and a whole lot of teams haven't seen one another. The geographic split in tournament travel makes things even more complicated. Teams listed high or low right now might see considerable changes in their rankings over the course of the season. It is entirely possible (even certain) that there are teams that have not performed in a way that's consistent with how good they "really are."

There have been significant changes to the ratings algorithm from last year. For a detailed description, please follow this link. The changes can be briefly summarized as follows:

- The ratings now use an implementation of the TrueSkill algorithm (TM Microsoft Corporation - Yay capitalism!) instead of the Glicko ratings system. The two systems share much of the same basic logic, but the core mathematical tools used are very different.
- The ratings now do a much better job accounting for opponent strength early in the season than in previous iterations. The algorithm does this by retroactively using data from "future" rounds to help it assess opponent strength when it is not sufficiently confident in its contemporaneous rating of them. An example: say you debated Team X at the first tournament of the year. The ratings do not have very much data at that point to determine how good they are. So, the algorithm will look at that team's future rounds at subsequent tournaments to create a provisional rating in order to give you more appropriate credit for your performance against them (it looks forward the minimum amount necessary to reach an adequate level of confidence). However, when you debate Team X again later in the year, say at the Wake Forest tournament, the ratings might now have a much better picture of how good your opponent is so they no longer have to create a provisional rating and will instead just use Team X's real (contemporaneous) one.
- The ratings now give extra weight to elimination round wins.
- I have stopped trying to account for partnership switches. Instead, teams will only be considered as a unit. If you debate with multiple partners, each pairing is considered a discrete unit. You will not get the credit from a win (or loss) with one partner on your rating with a different partner. While I feel bad for those that this might disadvantage, the obstacles are too large for me to overcome. It's possible that somebody might prove me wrong, but I have serious doubts that it is even possible to create an adequate model that accounts for partner switches given the type of information collected on ballots.

For a sense of what the ratings number actually means:

- A 1 point ratings advantage translates roughly into 5:4 expected odds,
- 2 points is about 3:2
- 3 points is about 2:1
- 4 points is about 3:1
- 5 points is about 4:1
- 8 points is about 9:1

If you are attentive to the

If you follow the link given above, you will find some graphics showing how the new ratings algorithm has performed when using data from the past four years. While the ratings don't explicitly set out to predict at large bids, they would have produced ballots well within the range of error of the actual human voters. They would have performed as a slightly below average voter for first round bids, but would have actually been an above average voter for second round bids.]]>

For next season there will be fairly significant changes to the debate ratings algorithm that will, I believe, mark substantial improvements both in terms of the accuracy of its predictions as well as how well it meets the "eye test." This is a long post, so I'll summarize the short and sweet of it here. If you're interested in the details, there is a lot to read below. I go into some depth concerning rationale and supporting data for the decisions that I've made.

The ratings were previously based on the Glicko algorithm developed by Mark Glickman (which were themselves inspired by the Elo rating system developed for chess competition by Arpad Elo). In the upcoming season, the debate ratings will instead shift to an adaptation of the TrueSkill algorithm which was developed by Microsoft Research. Some of you (who should feel guilty for not cutting enough cards) may be familiar with the TrueSkill system as the basis of the algorithm Microsoft uses for matchmaking in its Xbox video games. The logic undergirding TrueSkill is very similar to the Glicko rating system -- notably, they both use Bayesian inference methods and assume that an individual skill rating can be represented as a normal distribution with mean and deviation -- but they use different mathematical tools to get there. The debate ratings are my own adaptation that attempts to apply these mathematical tools to the peculiarities of college policy debate.

The reason for the change is simple: TrueSkill does a better job given the specific needs of the policy debate season. The bulk of this post will attempt to unpack the reasons why this is true. For now, it can just be said that TrueSkill results in ratings that more accurately predict actual round results as well as more closely reflect the wisdom of first and second round at-large bid votes.

The ratings were previously based on the Glicko algorithm developed by Mark Glickman (which were themselves inspired by the Elo rating system developed for chess competition by Arpad Elo). In the upcoming season, the debate ratings will instead shift to an adaptation of the TrueSkill algorithm which was developed by Microsoft Research. Some of you (who should feel guilty for not cutting enough cards) may be familiar with the TrueSkill system as the basis of the algorithm Microsoft uses for matchmaking in its Xbox video games. The logic undergirding TrueSkill is very similar to the Glicko rating system -- notably, they both use Bayesian inference methods and assume that an individual skill rating can be represented as a normal distribution with mean and deviation -- but they use different mathematical tools to get there. The debate ratings are my own adaptation that attempts to apply these mathematical tools to the peculiarities of college policy debate.

The reason for the change is simple: TrueSkill does a better job given the specific needs of the policy debate season. The bulk of this post will attempt to unpack the reasons why this is true. For now, it can just be said that TrueSkill results in ratings that more accurately predict actual round results as well as more closely reflect the wisdom of first and second round at-large bid votes.

I think if I were to summarize the fundamental limitation of the Glicko ratings, it would be that they are too conservative. The system has a tendency to underestimate the differences between teams and make predictions that understate the favorite's win probability. To put it in more concrete terms, the ratings might suggest that one team is a 2:1 favorite over another when, in reality, they are closer to a 3:1 favorite -- or in a more distorting scenario, a 10:1 favorite when the truth is actually that it would be a miracle for the underdog to win even one out of fifty.

The somewhat counter-intuitive consequence of this is that the ratings can be less accurate even as they are better at avoiding making big mistakes. The reason for this is that the accuracy of the ratings depends not just on them picking the right winners but also their ability to*predict the rate at which the favorite will also lose*. In other words, for example, in order for the ratings to be accurate in the aggregate, the underdog should win one out of every five debates in which the odds are 4:1 against them. If favorites substantially outperform their expected record, then that indicates that the ratings are not making good predictions because they are not adequately spread out.

Here is a graph depicting the retrodictions made by the final 2015-16 Glicko ratings. The x-axis is the expected win probability of the favorite, and the y-axis is the rate at which the ratings chose the correct winner. Ideally, we would want to see a perfectly diagonal line (i.e., for all the times the ratings suggest a 75% favorite, the favorite should win 75% of the time). Instead, what we find is a curve, indicating that favorites are winning a fair deal more than the ratings think they "should" be. For example, those that the ratings expect to be 75% favorites are actually winning over 85% of their rounds.

The somewhat counter-intuitive consequence of this is that the ratings can be less accurate even as they are better at avoiding making big mistakes. The reason for this is that the accuracy of the ratings depends not just on them picking the right winners but also their ability to

Here is a graph depicting the retrodictions made by the final 2015-16 Glicko ratings. The x-axis is the expected win probability of the favorite, and the y-axis is the rate at which the ratings chose the correct winner. Ideally, we would want to see a perfectly diagonal line (i.e., for all the times the ratings suggest a 75% favorite, the favorite should win 75% of the time). Instead, what we find is a curve, indicating that favorites are winning a fair deal more than the ratings think they "should" be. For example, those that the ratings expect to be 75% favorites are actually winning over 85% of their rounds.

Why does this discrepancy matter? The biggest reason is that if the predictions are too conservative, then the favorite may get "too much credit" for the win (or not enough blame for the loss). Since a team's rating goes up or down based on the difference between what the ratings expect to happen and what actually happens, an erroneous pre-round expectation leads to an erroneous post-round ratings update. The place where this appears to have created the biggest problem is situations where above average teams have been able to feast on lesser competition. At a typical tournament, power matching will help to ensure that a team will have to face off against an appropriate level of competition. However, this check might disappear entirely if a solid national level competitor travels to a smaller regional tournament where they are clearly at the top. As a result, after exceptional runs through regional competition, some teams have seen extraordinary bumps to their ratings that were perhaps unearned.

The graph for the new algorithm looks much better:

The graph for the new algorithm looks much better:

It's not a perfectly diagonal line, but it is much closer (75% favorites are winning around 77.5% of their rounds). TrueSkill accomplishes this by more effectively spreading the ratings out. For example, out of a total of over 6000 rounds in 2015-16, there were only a bit over 400 that the Glicko ratings considered to have heavy favorites with over 9:1 odds. By contrast, there were nearly 2000 rounds that TrueSkill considered to have these odds (parenthetically, I'm not sure what moral we should take from the fact that somewhere between a quarter and a third of all debates are virtually decided before they even happen).

By giving more spread between teams, the new algorithm will make it much less likely for a team to disproportionately benefit from defeating lesser competition.

By giving more spread between teams, the new algorithm will make it much less likely for a team to disproportionately benefit from defeating lesser competition.

One of the major concerns about the previous iteration of the ratings was that the system failed to distinguish between results that happen in the prelims versus the elims. I don't care to rehearse all of the arguments made for why elim rounds are different from prelim rounds. In the past, I have expressed some reservation about these arguments not because I necessarily disagree with their logic, but rather because there are a number of mathematical as well as conceptual questions that have to be answered before moving forward.

The first, and maybe most important, question concerns what end we're trying to accomplish by weighing elims differently. Are we actually looking for a quantitative statistical measure, or do we just want a rating that validates our qualitative impressions of what it means to win an elim round or tournament? To be clear, I’m not trying to give any value to the terms qualitative/quantitative or treat them as a mutually exclusive binary. I just think it’s important to know what we want. Maybe another way of saying this, especially apropos of elim success, is: where do you come down on the Kobe/LeBron debate? Is Kobe the greater one because of the Ringzzz and the clutchness and the assassin’s mentality, or LeBron because of PER, BPM, adjusted +/-, etc? I had an exchange with Strauss recently and he argued that no amount of NBA finals losses could ever add up to a single championship. We could extend the question to debate. Does any amount of prelim wins add up to an elim win? A tournament championship? If the goal is to value tournament championships then you don’t need fancy math to figure that out. Just count ‘em up. It’s easy.

Assuming that we are actually looking for a quantitative measurement, that raises a couple of follow-up issues.

For any kind of statistical quantification, an "a priori" decision has to be made concerning what is being measured -- the referent that attaches the stat to some meaningful part of reality. For the current algorithm, the referent is the ballot. The quality of the ratings is measured against how well it predicts/retrodicts ballots. Any tweaks can be evaluated based on whether they successfully increase the accuracy of the predictions.

Without a specific and measurable object against which to measure, a stat runs the risk of becoming arbitrary. This is why I have big objections to any kind of stat that assigns arbitrary or indiscriminate weight to specific rounds or "elim depth" (“finals at Wake is 100 points, quarters at GSU is 20, octas at Navy is 5, round 6 at Emporia is 1, etc”). This is the worst kind of stat: a qualitative evaluation (which there’s nothing wrong with on its own terms) masquerading as “hard” numerical quantification.

Given the need for a referent, there are a couple of ways of going about differentiating elims that I can think of:

1. Keep the referent the same (the ballot) but weigh elims differently. This would maybe be the easiest to implement in terms of the inputs, but there would be a potentially problematic elision that happens. See, the trick of the ratings is that the input and the output are actually basically the same thing. You input ballot results, and what you get out is a number that serves as a predictor of … ballot results. If you weigh elims differently, then you are actually now inputting two different variables. Not necessarily a huge problem*unless *the new variable distorts the correspondence of the output to the referent (i.e. makes it worse at predicting ballot results).

2. Change the referent (and also weigh elims differently). It would potentially be possible to create a rating with the express purpose of predicting elim wins rather than overall wins. However, we would still need to be clear about what specifically we want to predict. Tournament championships? All elim wins? At all tournaments or only the majors (and how do you define the majors)? Just the NDT? The big obstacle this would run up against is sample size (both of inputs and of results against which to test retrodictions). While a handful of teams may get 20ish regular season elim rounds, even good teams will have far less (there were first round bids with less than 10). Then you have the vast vast majority of people who are lucky to see one elim round (especially at a major). At its extreme, it is possible that this would be a stat that would only even hope to be statistically meaningful for a handful of teams (and even for those I would still have concerns about sample validity).

3. Running parallel overall and elim predictors would certainly be possible. However, beyond the fact that it would still have to address the elim sample size problem, my other concern here is ethical/political. Many (perhaps most) of the people that have contacted me to give support for doing the ratings have been from teams that are not at the very top. Maybe in some ways this is not surprising because those at the top already receive a lot of recognition. The ratings are one form of evidence of success for teams that otherwise may not receive a ton of it. It is not hard to predict that if one rating were designed expressly for the top 5-10 (or even 25) teams that the other would be devalued.

Nevertheless,**even given my concerns, I do still think there is potential value in figuring out an effective way to weigh elims differently**, and I think that I have discovered one that, while not perfect, does go a long way to help address the objections. The primary determinant of the ratings update algorithm will still be the quality of a team's opponent, but it is possible to apply a multiplier to the calculation for elim debates.

Below is a table that uses data from the past four seasons: a total of over 28,000 rounds, including about 2200 elims. There's a lot of information to process here, so I'll try to simplify it. Each row is a different iteration of the ratings using escalating elim win multipliers (EWM). I've also included the old Glicko rating as a point of comparison. The column boxes are four different metrics by which the accuracy of the ratings can be evaluated: 1) how well the final ratings retrodict all debate rounds from the year, 2) how well they retrodict only elimination rounds, 3) how well ratings that use only data through the holiday swing tournaments can predict all results after the swings, and 4) how well they predict only post-swing elim debates. "Correct" is the percentage that the ratings picked the right winner, MAE is the mean absolute degree of error, and MSE is the mean squared degree of error. Without going into great detail, MSE is different from MAE in that it magnifies the consequences of big misses.

Blue is good. Red is not so good.

The first, and maybe most important, question concerns what end we're trying to accomplish by weighing elims differently. Are we actually looking for a quantitative statistical measure, or do we just want a rating that validates our qualitative impressions of what it means to win an elim round or tournament? To be clear, I’m not trying to give any value to the terms qualitative/quantitative or treat them as a mutually exclusive binary. I just think it’s important to know what we want. Maybe another way of saying this, especially apropos of elim success, is: where do you come down on the Kobe/LeBron debate? Is Kobe the greater one because of the Ringzzz and the clutchness and the assassin’s mentality, or LeBron because of PER, BPM, adjusted +/-, etc? I had an exchange with Strauss recently and he argued that no amount of NBA finals losses could ever add up to a single championship. We could extend the question to debate. Does any amount of prelim wins add up to an elim win? A tournament championship? If the goal is to value tournament championships then you don’t need fancy math to figure that out. Just count ‘em up. It’s easy.

Assuming that we are actually looking for a quantitative measurement, that raises a couple of follow-up issues.

For any kind of statistical quantification, an "a priori" decision has to be made concerning what is being measured -- the referent that attaches the stat to some meaningful part of reality. For the current algorithm, the referent is the ballot. The quality of the ratings is measured against how well it predicts/retrodicts ballots. Any tweaks can be evaluated based on whether they successfully increase the accuracy of the predictions.

Without a specific and measurable object against which to measure, a stat runs the risk of becoming arbitrary. This is why I have big objections to any kind of stat that assigns arbitrary or indiscriminate weight to specific rounds or "elim depth" (“finals at Wake is 100 points, quarters at GSU is 20, octas at Navy is 5, round 6 at Emporia is 1, etc”). This is the worst kind of stat: a qualitative evaluation (which there’s nothing wrong with on its own terms) masquerading as “hard” numerical quantification.

Given the need for a referent, there are a couple of ways of going about differentiating elims that I can think of:

1. Keep the referent the same (the ballot) but weigh elims differently. This would maybe be the easiest to implement in terms of the inputs, but there would be a potentially problematic elision that happens. See, the trick of the ratings is that the input and the output are actually basically the same thing. You input ballot results, and what you get out is a number that serves as a predictor of … ballot results. If you weigh elims differently, then you are actually now inputting two different variables. Not necessarily a huge problem

2. Change the referent (and also weigh elims differently). It would potentially be possible to create a rating with the express purpose of predicting elim wins rather than overall wins. However, we would still need to be clear about what specifically we want to predict. Tournament championships? All elim wins? At all tournaments or only the majors (and how do you define the majors)? Just the NDT? The big obstacle this would run up against is sample size (both of inputs and of results against which to test retrodictions). While a handful of teams may get 20ish regular season elim rounds, even good teams will have far less (there were first round bids with less than 10). Then you have the vast vast majority of people who are lucky to see one elim round (especially at a major). At its extreme, it is possible that this would be a stat that would only even hope to be statistically meaningful for a handful of teams (and even for those I would still have concerns about sample validity).

3. Running parallel overall and elim predictors would certainly be possible. However, beyond the fact that it would still have to address the elim sample size problem, my other concern here is ethical/political. Many (perhaps most) of the people that have contacted me to give support for doing the ratings have been from teams that are not at the very top. Maybe in some ways this is not surprising because those at the top already receive a lot of recognition. The ratings are one form of evidence of success for teams that otherwise may not receive a ton of it. It is not hard to predict that if one rating were designed expressly for the top 5-10 (or even 25) teams that the other would be devalued.

Nevertheless,

Below is a table that uses data from the past four seasons: a total of over 28,000 rounds, including about 2200 elims. There's a lot of information to process here, so I'll try to simplify it. Each row is a different iteration of the ratings using escalating elim win multipliers (EWM). I've also included the old Glicko rating as a point of comparison. The column boxes are four different metrics by which the accuracy of the ratings can be evaluated: 1) how well the final ratings retrodict all debate rounds from the year, 2) how well they retrodict only elimination rounds, 3) how well ratings that use only data through the holiday swing tournaments can predict all results after the swings, and 4) how well they predict only post-swing elim debates. "Correct" is the percentage that the ratings picked the right winner, MAE is the mean absolute degree of error, and MSE is the mean squared degree of error. Without going into great detail, MSE is different from MAE in that it magnifies the consequences of big misses.

Blue is good. Red is not so good.

One thing is immediately apparent: every version of TrueSkill performs substantially better than Glicko by just about every metric.

The second thing sticks out is that there is no easy and direct relationship between weighing elims and helping or hurting the accuracy of the ratings. It depends on which metric you prioritize:

I want to briefly unpack what is happening in a couple of the sections of the table, especially in the Elim Retrodictions portion. While the Elim Retrodictions provides some important information, there is a risk of imputing too much significance to its findings. Because of limitations in the sample and the fact that these are retrodictions (rather than predictions), there is a risk of overfitting the model to the peculiarities, randomness and noise of a small set of past results rather than providing a generalized model with predictive power. The stark contrast between the Elim Retrodictions and the Post-Swing Elim Predictions boxes helps to highlight the problem. Both recognize that weighting elim wins helps the accuracy of the ratings in its evaluation of elim debates, but they disagree over which level of weighting is optimal. When we try to*retrodict* the past, a very high elim win multiplier works best, but when we try to *predict* the future, things become more complicated.

The numbers in the table are relatively abstract, so I'm going to take a detour that I think should help to show how these numbers play out in more concrete terms.

The second thing sticks out is that there is no easy and direct relationship between weighing elims and helping or hurting the accuracy of the ratings. It depends on which metric you prioritize:

- Every bit of added elim weighting hurts the correct pick rate and MSE in the All Round Retrodictions test, but it does linearly decrease MAE.
- Up to a (fairly high) point, extra weight for elims does increase the accuracy of Elim Retrodictions across the board.
- A moderately high elim weighting boosts the correct pick rate and the MAE for Post-Swing Predictions, but also harms MSE.
- The most confusing result comes in the Post-Swing Elim Predictions test, where a mid-range elim weighting is best at picking correct winners, high weighting best for reducing MAE, and no weighting best for minimizing MSE.

I want to briefly unpack what is happening in a couple of the sections of the table, especially in the Elim Retrodictions portion. While the Elim Retrodictions provides some important information, there is a risk of imputing too much significance to its findings. Because of limitations in the sample and the fact that these are retrodictions (rather than predictions), there is a risk of overfitting the model to the peculiarities, randomness and noise of a small set of past results rather than providing a generalized model with predictive power. The stark contrast between the Elim Retrodictions and the Post-Swing Elim Predictions boxes helps to highlight the problem. Both recognize that weighting elim wins helps the accuracy of the ratings in its evaluation of elim debates, but they disagree over which level of weighting is optimal. When we try to

The numbers in the table are relatively abstract, so I'm going to take a detour that I think should help to show how these numbers play out in more concrete terms.

Strictly speaking, the ratings do not attempt to replicate/predict the at-large bids for the NDT. However, the bid voters probably represent the clearest proxy that we have for "conventional wisdom" or the judgment of the community, and we can use the votes as a way to externally check how well the ratings produce results that are in line with the expert judgment of human beings that are able to account for various contextual factors outside the scope of the information available to the ratings algorithm.

Here is a table that uses data from the last four years that shows how each iteration of the ratings with different levels of elim win multipliers compare. It contrasts the actual bid vote results against the hypothetical ballots that would be produced by the ratings algorithm. MAE is the average (mean) amount that the computer ballot deviated per team from the final aggregate vote. MAE Rnk is how this would have ranked among the human ballots. So, for example, over the last four years, the basic TrueSkill algorithm without any elim multiplier produced first round at-large bid ballots that rated teams, on average, within about 1.65 spots of where they actually ended up. This would average as the 12th best human voter per year. MSE and MSE Rnk are similar, except they are weighted to magnify the consequence of larger errors (big misses).

For comparison, I've also included the average errors of the human voters themselves at the top. Once again, blue is good, red is not so good.

Here is a table that uses data from the last four years that shows how each iteration of the ratings with different levels of elim win multipliers compare. It contrasts the actual bid vote results against the hypothetical ballots that would be produced by the ratings algorithm. MAE is the average (mean) amount that the computer ballot deviated per team from the final aggregate vote. MAE Rnk is how this would have ranked among the human ballots. So, for example, over the last four years, the basic TrueSkill algorithm without any elim multiplier produced first round at-large bid ballots that rated teams, on average, within about 1.65 spots of where they actually ended up. This would average as the 12th best human voter per year. MSE and MSE Rnk are similar, except they are weighted to magnify the consequence of larger errors (big misses).

For comparison, I've also included the average errors of the human voters themselves at the top. Once again, blue is good, red is not so good.

This is where things get interesting. While none of the computer algorithm ballots have been as good as the average bid voter when it comes to First Round Bids, the distance separating them is not massive. In particular, the TrueSkill algorithm with an elim win multiplier of 3 would rank as just a little below the average bid vote. Over the last four years, it would ring in as, on average, the 9th best voter as measured by MSE. While this may not seem spectacular, it does mean that the ratings produce results that are well within the range of expert human judgment.

Perhaps more significantly,*the ratings have actually been better than the average bid vot**er* when it comes to Second Round Bid voting. At lower levels of elim win multiplier, the ratings would rank, on average, as around the 5th or 6th most accurate voter over the last four years. In fact, the TrueSkill ratings with an elim win multiplier of 3 would have produced a ballot for the 2015-16 season that **would have resembled the final aggregate vote more closely than any of the human voters.**

The other thing that is quickly apparent from the table is that higher levels of EWM produce results that are wildly divergent from the judgment of bid voters. One of the big reasons for this is that excessive weight on elim rounds will drastically magnify the recency effect of the ratings. Those who do well at the last tournament of the year will see a significant boost in their rating that can't be checked back by subsequent information.

Perhaps more significantly,

The other thing that is quickly apparent from the table is that higher levels of EWM produce results that are wildly divergent from the judgment of bid voters. One of the big reasons for this is that excessive weight on elim rounds will drastically magnify the recency effect of the ratings. Those who do well at the last tournament of the year will see a significant boost in their rating that can't be checked back by subsequent information.

The numbers have convinced me that it is possible to give added weight to elim debates within the parameters of the TrueSkill algorithm in a way that helps the ratings more closely reflect the collective common sense of the community without jeopardizing the accuracy of the system's predictions -- in fact, in some ways the predictions may be enhanced.

While there is no clear answer on exactly how much extra weight elimination rounds should receive, I have decided on a multiplier of 3 as the Goldilocks option. It seems to hit the sweet spot with regard to the judgment of bid voters, and while not perfect by any of the prediction accuracy metrics, it does manage to make the ratings more accurate by many measures. At the end of the 2016-17 season, I will integrate the new data and reevaluate to determine if a change should be made.

While there is no clear answer on exactly how much extra weight elimination rounds should receive, I have decided on a multiplier of 3 as the Goldilocks option. It seems to hit the sweet spot with regard to the judgment of bid voters, and while not perfect by any of the prediction accuracy metrics, it does manage to make the ratings more accurate by many measures. At the end of the 2016-17 season, I will integrate the new data and reevaluate to determine if a change should be made.

There is one other major change in the ratings that may be just as important as the shift in the basic algorithm: a way of dealing with lack of information early in the season. The solution involves running the algorithm twice: once to form a data frame of provisional opponent ratings, and a second time to formulate each team's actual rating.

For both Glicko and TrueSkill (as well as Elo) rating systems, the difference in ratings between two opponents indicates the probability of the outcome of a debate between them. A small ratings difference indicates fairly evenly matched teams, while a large ratings difference suggests that one team is a heavy favorite. Team ratings go up or down based on the difference between the predicted outcome of the debate and the actual outcome. After the round, each team's ratings will be recalculated based on how they performed against expectations. So, if a team defeats an opponent that it was heavily expected to defeat, then its rating may barely move at all. But if an underdog overcomes the odds and wins a big upset, then their rating would move a much larger amount. Evenly matched opponents will experience changes somewhere in the middle. As a result, opponent strength is integrated from the beginning of the calculation. Wins over stronger opponents are worth more because there is a larger difference between actual outcome and predicted outcome.

The difficulty arises at the beginning of the season when there is a lack of information to form a reliable rating. In order to formulate a prediction concerning the outcome of a given debate the ratings need to be able to assess the strength of each team. If too few of rounds have yet occurred, then the algorithm's prediction is far less reliable. This can be seen at the extreme before round one of the first tournament of the year when zero information is available to form a prior expectation. The ratings are a blank slate in this moment and incapable of distinguishing whether you are debating against college novices fresh out of their first interest meeting or last year's national champions.

In its previous iteration, the ratings relied on one very helpful tool to cope with this problem: deviation. Each team's rating distribution is defined by two parts: the mean of its skill distribution and the breadth of the variation in that skill distribution. In more basic stats terms, this can be understood as similar to a confidence interval. The algorithm expresses more confidence in a team's rating as its deviation goes down. It uses this confidence to weight how much a single debate round can influence a team's rating. A team with a large deviation will see their rating fluctuate rapidly (and all teams start the season with very large deviations), while a team with a low deviation will have the weight of their previous rounds prevent a new round result from having too much influence. Deviation goes down as you get more rounds.

Deviation helps the ratings cope with lack of information at the beginning of the season. The default is that each team begins the season with a very large deviation, which is basically the algorithm's attempt to acknowledge that it is not confident in the mean rating. Since deviation is used to weight the post-round ratings update such that those with large deviations experience larger changes, this allows such a team's rating to more to more quickly self-correct from earlier inaccurate predictions. Additionally, losses to teams with high deviations have less of an effect than to those with low deviations.

While deviation helps to significantly mitigate the effect of limited information at the beginning of the season, it does not entirely resolve the problem. The effects of erroneous predictions are substantially evened out over time, but they are never completely eliminated and can add up. For an individual team the effect will be quite small, often negligible. However, the recent trend toward segregation in the early travel schedule magnifies the problem, especially if there is a disparity in the strength of competition at the different tournaments. If there is no prior information on the teams, the ratings are unable to distinguish between an undefeated team at one tournament versus another.

The current iteration of the ratings relies on the eventual merging of the segregated pools of debaters to even things out over time. Given enough time and intermingling of teams, it would. Unfortunately, the debate season is a closed cycle, and the ratings would be helped if they could accelerate the process.

The solution to this problem is actually relatively simple. If the problem is a lack of information to form a reliable rating for assessing how good your opponent is, then what we need to do is give the algorithm more information. One way to do this would be to go into the past, using results from previous seasons to form an estimate of the team's skill. However, beyond the fact that this doesn't address the lack of information on first year debaters or the complexities of new partnerships, I find this solution undesirable because it also violates the singularity of each season.

Instead, what we can do is use information from the*future *to gain a more accurate picture of opponent quality. It is possible to use results from subsequent rounds to form a better estimate of how good a given opponent is. What the new ratings do is effectively run the ratings algorithm twice. On the first pass, it creates a provisional rating for each team that uses all available information -- for example, when I update the ratings in January after the swing tournaments, it will use all rounds from the beginning of the season through those tournaments. On the second pass, it will use those provisional ratings in its predictions to estimate opponent strength until such a time as that opponent has a sufficiently reliable rating.

To be clear, this does not involve double counting. The provisional rating is only ever used to evaluate how strong an*opponent* is. The second pass starts each team's actual rating from scratch. When Team A debates Team B in round one of the season opener, the ratings create a separate prediction for each side. One prediction will be between the ratings for a blank slate Team A versus a reliable Team B; the other between a blank slate Team B and a reliable Team A. The first will be used to update Team A's rating, the latter to update Team B's rating. The algorithm eventually stops using a team's provisional rating once that team's actual rating becomes reliable enough (i.e., its deviation becomes small enough, a length of time that varies, but is usually reached in the neighborhood of 25 rounds).

For both Glicko and TrueSkill (as well as Elo) rating systems, the difference in ratings between two opponents indicates the probability of the outcome of a debate between them. A small ratings difference indicates fairly evenly matched teams, while a large ratings difference suggests that one team is a heavy favorite. Team ratings go up or down based on the difference between the predicted outcome of the debate and the actual outcome. After the round, each team's ratings will be recalculated based on how they performed against expectations. So, if a team defeats an opponent that it was heavily expected to defeat, then its rating may barely move at all. But if an underdog overcomes the odds and wins a big upset, then their rating would move a much larger amount. Evenly matched opponents will experience changes somewhere in the middle. As a result, opponent strength is integrated from the beginning of the calculation. Wins over stronger opponents are worth more because there is a larger difference between actual outcome and predicted outcome.

The difficulty arises at the beginning of the season when there is a lack of information to form a reliable rating. In order to formulate a prediction concerning the outcome of a given debate the ratings need to be able to assess the strength of each team. If too few of rounds have yet occurred, then the algorithm's prediction is far less reliable. This can be seen at the extreme before round one of the first tournament of the year when zero information is available to form a prior expectation. The ratings are a blank slate in this moment and incapable of distinguishing whether you are debating against college novices fresh out of their first interest meeting or last year's national champions.

In its previous iteration, the ratings relied on one very helpful tool to cope with this problem: deviation. Each team's rating distribution is defined by two parts: the mean of its skill distribution and the breadth of the variation in that skill distribution. In more basic stats terms, this can be understood as similar to a confidence interval. The algorithm expresses more confidence in a team's rating as its deviation goes down. It uses this confidence to weight how much a single debate round can influence a team's rating. A team with a large deviation will see their rating fluctuate rapidly (and all teams start the season with very large deviations), while a team with a low deviation will have the weight of their previous rounds prevent a new round result from having too much influence. Deviation goes down as you get more rounds.

Deviation helps the ratings cope with lack of information at the beginning of the season. The default is that each team begins the season with a very large deviation, which is basically the algorithm's attempt to acknowledge that it is not confident in the mean rating. Since deviation is used to weight the post-round ratings update such that those with large deviations experience larger changes, this allows such a team's rating to more to more quickly self-correct from earlier inaccurate predictions. Additionally, losses to teams with high deviations have less of an effect than to those with low deviations.

While deviation helps to significantly mitigate the effect of limited information at the beginning of the season, it does not entirely resolve the problem. The effects of erroneous predictions are substantially evened out over time, but they are never completely eliminated and can add up. For an individual team the effect will be quite small, often negligible. However, the recent trend toward segregation in the early travel schedule magnifies the problem, especially if there is a disparity in the strength of competition at the different tournaments. If there is no prior information on the teams, the ratings are unable to distinguish between an undefeated team at one tournament versus another.

The current iteration of the ratings relies on the eventual merging of the segregated pools of debaters to even things out over time. Given enough time and intermingling of teams, it would. Unfortunately, the debate season is a closed cycle, and the ratings would be helped if they could accelerate the process.

The solution to this problem is actually relatively simple. If the problem is a lack of information to form a reliable rating for assessing how good your opponent is, then what we need to do is give the algorithm more information. One way to do this would be to go into the past, using results from previous seasons to form an estimate of the team's skill. However, beyond the fact that this doesn't address the lack of information on first year debaters or the complexities of new partnerships, I find this solution undesirable because it also violates the singularity of each season.

Instead, what we can do is use information from the

To be clear, this does not involve double counting. The provisional rating is only ever used to evaluate how strong an

There are a couple of other small changes that will have some effect on how the rankings are calculated.

The first concerns how the final ratings are determined by subtracting a team's deviation from their rating to produce an "adjusted rating." The original reason for doing this is that it gives a more "confident" rating by adjusting downward those teams that we have less data about. In effect, it says that we are confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.

While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere between 80 and 100 rounds, it is somewhat silly to use deviation as a tool to delineate between them. This past year, there were a few examples even in the top 25 where a lower rated team was able to jump a higher rated team merely because they had a few more debates under their belts.

In the future, the ratings will continue to use deviation as a tool to adjust the ratings of teams, but it will stop making delineations once teams reach a certain threshold. This threshold will be calculated as the median of the 100 smallest deviations.

A second change is that the ratings will no longer attempt to model individual debaters with multiple partnerships. Instead, it will treat each two person team as a discrete unit. The obstacles to being able to model multiple partnerships are just too large, primarily because we just don't collect the kind of data that would make it possible. How much is each partner responsible for a win? Does this question even make sense to ask? We all know that a great debater can carry a poor partner to a lot of wins. But we also know that even a good debater will lose rounds that they otherwise wouldn't have if they travel with a partner with lesser skill.

I know that this may disadvantage some debaters who are forced to frequently change partners, but it would be generous to even say that my previous attempts to solve the problem looked like trying to duct tape a windshield on. It kinda sorta worked, but mostly by luck, and even still had the effect of perhaps unfairly harming the ratings of some debaters.

Finally, I have removed the eigenvector centrality component of the previous system. This was originally a way to ensure that a team possessed a set of rounds that were adequately integrated into the larger community pool. TrueSkill doesn't need it.

The first concerns how the final ratings are determined by subtracting a team's deviation from their rating to produce an "adjusted rating." The original reason for doing this is that it gives a more "confident" rating by adjusting downward those teams that we have less data about. In effect, it says that we are confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.

While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere between 80 and 100 rounds, it is somewhat silly to use deviation as a tool to delineate between them. This past year, there were a few examples even in the top 25 where a lower rated team was able to jump a higher rated team merely because they had a few more debates under their belts.

In the future, the ratings will continue to use deviation as a tool to adjust the ratings of teams, but it will stop making delineations once teams reach a certain threshold. This threshold will be calculated as the median of the 100 smallest deviations.

A second change is that the ratings will no longer attempt to model individual debaters with multiple partnerships. Instead, it will treat each two person team as a discrete unit. The obstacles to being able to model multiple partnerships are just too large, primarily because we just don't collect the kind of data that would make it possible. How much is each partner responsible for a win? Does this question even make sense to ask? We all know that a great debater can carry a poor partner to a lot of wins. But we also know that even a good debater will lose rounds that they otherwise wouldn't have if they travel with a partner with lesser skill.

I know that this may disadvantage some debaters who are forced to frequently change partners, but it would be generous to even say that my previous attempts to solve the problem looked like trying to duct tape a windshield on. It kinda sorta worked, but mostly by luck, and even still had the effect of perhaps unfairly harming the ratings of some debaters.

Finally, I have removed the eigenvector centrality component of the previous system. This was originally a way to ensure that a team possessed a set of rounds that were adequately integrated into the larger community pool. TrueSkill doesn't need it.

I've attached a copy of the R code that I use to run the algorithm. I make no claims to being a good coder. What little I know is self-taught. It's slow, but it gets the job done. XML files of tournament results are available for download on tabroom by using their api.

Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.

Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.

trueskill_code.r |

Congrats to everybody on a great season. If you went to a tournament, I applaud you. The first one is always the hardest.

Just a reminder to not just look at the rank order, but also check out the Adjusted Rating to get a sense of how close some teams are. A difference between two teams that is in the single digits basically means a coin flip. Teams 10 through 14 are functionally tied. As are teams 15 through 18. Very small point spreads can sometimes make for fairly large rank differences.

]]>

Second, these ratings do not include the Dartmouth Round Robin. Those results are not yet posted on Tabroom, and it would be a serious hassle for me to try to manually code them. If I were to

Third, I know there's a lot of stress about bid voting. I want to make it clear that these ratings are not intended to replicate the decisions of bid voters. I would not encourage anybody to take the top 16 here and assume that those teams are necessarily first rounds (nor would I expect that any bid voter would do this). The ratings provide one way of processing and understanding who the best teams in the country are. However, because of the nature of the beast, the ratings are also capable of some misses when something is limited/different/peculiar about the data for a team: maybe a team hasn't attended as many tournaments, maybe their schedule has included more regionals at the expense of national level tournaments, maybe they've had multiple partners, etc.

Fourth, caveats aside, the ratings are information. They are a way of processing results that produces a good picture of who is likely to beat whom. To that extent, I think that they can a useful tool for bid voters.

Fifth, participation awards for everybody! Everybody's a winner!

Sixth, some teams are more winners than others! Congrats to