Estimates vs. head to head comparisons
Summary: when choosing between two options, it’s not always optimal to estimate the value of each option and then pick the better one.
Suppose I am choosing between two interventions, X and Y. One way to make my decision is to predict what will happen if I do X and predict what will happen if I do Y, and then pick the option which leads to the outcome that I prefer.
My predictions may be both vague and error-prone, and my value judgments might be very hard or nearly arbitrary. But it seems like I ultimately must make some predictions, and must decide how valuable the different outcomes are. So if I have to evaluate N options, I could do it by evaluating the goodness of each option, and then simply picking the option with the highest value. Right?
There are other possible procedures for evaluating which of two options is better. For example, I have often encountered advice of the form “If your error bars are too big, you should often just ignore the estimate.”. To be most extreme, I could choose some particular axis on which options can be better or worse, and then pick the option which is best on that axis, ignoring all others. (E.g., I could choose the option which is cheapest, or the charity which is most competently administered, or whatever.)
If you have an optimistic quantitative outlook like mine, this probably looks pretty silly—after all, if one option is cheaperthat just gets figured into my estimate for how good it is. If my error bars are big, as long as I keep track of the error bars in my calculation it is still better than nothing. So why would I ever want to do anything other than estimate the value of each option?
In fact I don’t think the naive intuition is quite right. To see why, let’s start with a very simple case.
A simple model
Alice and Bob are picking between two interventions X and Y. They only have a year to make their decision, so they split up: Alice will produce an estimate of the value of X and Bob will produce an estimate of the value of Y, and they will both do whichever one is higher. Let’s suppose that Alice and Bob are perfectly calibrated and trust each other completely, so that each of them believes the other’s estimate to be unbiased.
Suppose that intervention X is good because it reduces carbon emissions. First Alice dutifully estimates the reductions in emissions that result from intervention X, call that number A1. Of course Alice doesn’t care about carbon emissions per se, she cares about the improvements in human quality of life that result from decreased emissions. (And she couldn’t compare her number with Bob’s unless she converts into units of goodness, not units of carbon emissions.) So she next estimates the gain in quality of life per unit of reduced emissions, call that number A2. She then reports that the value of X is A1 * A2; because she is unbiased, as long as her estimates of A1 and A2 are independent she obtains an unbiased estimate of the value of X.
Meanwhile, it happens to be the case that intervention Y is also good because it reduces carbon emissions. So Bob similarly estimates the reduction in carbon emissions from intervention Y, B1, and then the goodness of reduced emissions, B2, and reports B1 * B2. His estimate is also an unbiased estimate of the value of Y.
The pair decides to do intervention X iff it appears to have a higher value than Y, i.e. iff A1 * A2 > B1 * B2, which is not crazy. But it’s also not a very good idea. It is easy to see that intervention X is better than intervention Y iff A1 > B1. But if estimates A2 and B2 are relatively noisy—especially if the noise in those estimates is larger than the actual gap between A1 and B1—then Alice and Bob will make an essentially random decision.
What went wrong? Alice and Bob aren’t making a systematically bad decision, but they could have made a better decision by using a different technique for comparison. I think that a similar situation arises very often, in much less simple and slightly less severe situations. This often means that it is better to try and make comparisons of the form “Is X better than Y?” than to try and independently estimate the value of X and Y. When making a comparison between X and Y, we can minimize uncertainty by making the analyses as similar to each other as possible.
Of course this example was very simple, and there are lots of reasons you might expect more realistic estimates to be safe from these problems. I think that, despite all of these divergences, this simple model captures a common failure in estimation. The most basic reason to think this is that the argument above shows that there is no general reason to expect independent estimates of value to yield optimal results. Without a general reason to think that this procedure works, it seems to be on much shakier ground. But to make the point, here are responses to some of the most obvious objections:
1. If estimating the impact of X and estimating the impact of Y both involve estimating some common quantity, can’t I just recognize that in advance and make sure I use the same estimate in both cases? How could this ever be a problem in practice?
In practice different estimates rarely involve estimating exactly the same quantities. If I want to compare the goodness of health interventions and education interventions in the developing world, the most natural estimates might not have a single step in common. Nevertheless, each of those estimates would involve many uncertainties about social dynamics in the developing world, long-term global outcomes, and so on. I could do my analysis in a way that introduced analogies between the two estimates, and this would help me eliminate some of this uncertainty (even if the resulting estimates were noisier, or involved ignoring some apparently useful information).
The point is: in order to prove that comparing value estimates is optimal, it is not enough to assume that my beliefs are well-calibrated. I also need to assume that my beliefs make use of all available information (including having considered every alternative estimation strategy that sheds light on the question), which is unrealistic even for an idealized agent unless it is logically omniscient. When my beliefs don’t make use of all available information, then other techniques for comparison might do better, including using different estimates which have more elements in common. (But in some cases, even very simple approaches like “do the cheapest thing” will be predictably better.)
I don’t really know the extent of these problems in practice. I’m not familiar with a theoretical literature on this question. It seems like a pretty messy question in general, but I expect it would be possible to make meaningful headway on it.
2. Alice and Bob had a hard time because they are two different people. I agree that I shouldn’t compare estimates from different people, but if I do all of the estimates myself it seems like this isn’t a problem.
When I try to estimate the same thing several times, without remembering my earlier estimates, I tend to get different results. I strongly suspect this is universal, though I haven’t seen research on that question.
Moreover, when I try to estimate different things, my estimates tend not to obey the logical relationships that I know the estimated quantities must, unless I go back through with those particular relationship in mind and enforce them. For example, if I estimate A and B separately, the sum is rarely the same as if I estimated A+B. When the relationships amongst items are complicated, such consistency is unrealistically difficult to enforce. Of course, the prospects for making comparisons also suffer. It may be that there is some principled way to get around these problems, but I don’t know it.
Alice and Bob’s estimates don’t have to be very far from each other before they could have done better. I agree that estimates from a single person will have a higher degree of consistency than estimates from different people, but they won’t be consistent enough to remove the problem (or opportunity for improvement, if you want to look at it from a different angle).
3. The weird behavior in the example came from the artificial structure of the problem. How often could you do such factoring out for realistic estimates, even when they are similar?
If I’m trying to estimate the effect of different health interventions, the first step would be to separate the question “How much does this improve people’s health?” from “How much does improving people’s health matter?” So that already factors out a big piece of the uncertainty. I think most people do that, though, and so the question is: can you go farther?
I think it is still easier to estimate “Which of these interventions improve health more?” than to estimate the absolute improvement from either. We can break this comparison down into still smaller comparisons: “How many more or fewer people does X reach than Y?” and “Per person affected, what is the relative impact of X and Y?” etc. By focusing on the most important comparisons, and writing the others off as a wash, we might be able to reduce the total error in our comparison.
Trying to explicitly estimate the goodness of outcomes tends to draw a lot of criticism from pretty much every side. I think most of this criticism is unjustified (and often rooted in an aversion to making reasoning or motivations explicit, a desire to avoid offense or culpability, etc.). Nevertheless, there are problems with many straightforward approaches to quantitative estimation, and some qualitative processes improve on quantitative estimation in important ways. Many of these improvements are often dismissed by optimistic quantitative types (myself included), and I think that is an error. For example, I mentioned that I’ve often dismissed arguments of the form “If your error bars are too big, you are sometimes better off ignoring the data.” This looks obviously wrong on the Bayesian account, as far as I can tell it may actually be the best behavior—even for idealized, bias-free humans.
[…] can fail, see this post on GiveWell). There’s also reason to think it’s better to do head-to-head comparisons rather than stand-alone estimates of impact. Detailed quantified estimates are also difficult to […]
[…] can fail, see this post by GiveWell). There’s also reason to think it’s better to do head-to-head comparisons rather than full impact […]