# Expected value estimates we (cautiously) took literally - Oxford Prioritisation Project

*By Tom Sittler*

*Cross-posted from the Oxford Prioritisation Project blog. We're centralising all discussion on the Effective Altruism forum. To discuss this post, please comment here.*

**Summary: **This post describes how we turned the cost-effectiveness estimates of our four shortlisted organisations into a final decision. In order to give adequately more weight to more robust estimates, we use the four model outputs to update a prior distribution over grantee organisation impacts.

# Introduction

Inspired by Michael Dickens’ example, we decided to formalise the intuition that more robust estimates should get higher weights. We did this by treating the outputs of our models as providing an evidence distribution, which we use to update a prior distribution over the cost-effectiveness of potential grantee organisations.

# Code

The code we used to compute the Bayesian posterior estimates is here.

# The prior distribution

This is the unconditional (prior) probability distribution of the cost-effectiveness of potential grantees. We use a lognormal distribution. A theoretical justification for this is that we expect cost-effectiveness to be the result of a multiplicative rather than additive process. A possible empirical justification could be the distribution of cost-effectiveness of DCP-2 interventions. Again, this has been discussed at length elsewhere.

The parameters of a lognormal distribution are its scale and location. The scale is equal to the standard deviation of the natural logarithm of the values, and the location is equal to the mean of the natural logarithm of the values. The median of a lognormal distribution is the exponential of its location.

We choose the location parameter such that the median of the distribution is as cost-effective as highly effective global health interventions such a those recommended by GiveWell, which we estimate to provide a QALY for $50. Intuitively, this means that the set of organisations we were considering funding at the start of the project had a median cost-effectiveness of 0.02 QALYs/$.

We set the scale parameter as 0.5, which means that the standard deviation of the natural logarithm of our prior is 25 times the mean of the natural logarithm of our prior. This is a relatively poorly informed guess, which we arrived at mostly by looking at the choices of Michael Dickens and checking that they did not intuitively seem absurd to team members.

Had we chosen a scale parameter more than about 2.2 times as large, the Machine Intelligence Research Institute would have had the highest posterior cost-effectiveness estimate.

# The evidence distribution

We fit the outputs of our models, which are lists of numbers, to a lognormal probability distribution. The fit is excellent, as you can see from the graphs below. On the log scale, the probability density function of our original data appears in black and the probability density function of data randomly generated from the lognormal distribution we fitted to the original data appears in red.

This is the graph for StrongMinds:

And the graph for MIRI:

The other graphs look very similar, so I’m not including them here. You can generate them using the code I provide.

# What about negative values?

The models for Animal Charity Evaluators and Machine Intelligence Research institute contain negative values, so they cannot be fitted to a lognormal distribution.

Instead, we split the data into a positive and a negative lognormal, which we update separately on a positive and a negative prior.

Intuitively, we think that both interventions that do a large amount of good (in the tail of the positive prior) and interventions that do a large amount of hard (in the tail of the negative prior) are unlikely in priors.

# Updating when distributions are lognormal

In my other post, I derive a closed-form solution to the problem of updating a lognormal prior using a lognormal evidence distribution.

# Units

A word on units: inside each of the four models, we convert all estimates to “Human-equivalent well-being-adjusted life-years” (HEWALYs). One HEWALY is a QALY, or a year of life as a fully healthy, modern-day human. If an action produces zero HEWALYs, we are indifferent between doing it and not doing it. Negative HEWALYs correspond to lives not worth living, and -1 HEWALY is as bad as 1 HEWALY is good. In other words, we are indifferent between causing 0 HEWALYs and causing both 1 HEWALY and -1 HEWALY.

A being can accrue more than 1 HEWALY per year, because life can be better than life as a fully healthy modern-day human. Symmetrically, a being can accrue less than -1 HEWALY per year.

# Results

You can view the results and code here. If you disagree with our prior parameters, we encourage you to try our own values and see what you come up with, in the style of GiveWell, who provide their parameters as estimated by each staff member. We also include commented-out code to visualise how the posterior estimates depend on the prior parameters.

# Interesting phenomena

**Our prior strongly punishes MIRI. **While the mean of its evidence distribution is 2,053,690,000 HEWALYs/$10,000, the posterior mean is only 180.8 HEWALYs/$10,000. If we set the prior scale parameter to larger than about 1.09, the posterior estimate for MIRI is greater than 1038 HEWALYs/$10,000, thus beating 80,000 Hours.

**Our estimate of StrongMinds is lower than our prior.** The StrongMinds evidence distribution had a mean 17.9 HEWALYs/$10,000 which is lower than the posterior of 18.5 HEWALYs/$10,000. We can interpret this in the following way: we found evidence that StrongMinds is has surprisingly (relative to our prior) low cost-effectiveness, so taking into account the prior leads us to increase our estimate of StrongMinds.

## Comments (12)

BestThis suggests that it might be good in the long run to have a process that learns what prior is appropriate, e.g. by going back and seeing what prior would have best predicted previous years' impact.

With the possible exception of StrongMinds, it's not the case that the previous years' impact is much easier to estimate than 2017's impact.

*1 point [-]My personal take on the issue is that, the better we understand how the updating works (including how to select the prior), the more seriously we should take the results. Currently we don't seem to have a good understanding (e.g. see Dickens' discussion: the way of selecting the median based on Give Directly seems reasonable, but there doesn't seem to be a principled way of selecting the variance, and this seems to be the best effort at it so far), so these updating exercises can be used as heuristics but the results are not to be taken too seriously, and certainly not literally (together with the reason that input values are so speculative in some cases).

This is just my personal view and certainly many people disagree. E.g. my team decided to use the results of Bayesian updating to decide on the grant recipient.

My experience with the project lead me to be not very positive that it's worth investing too much in improving this quantitative approach for the sake of decision making, if one could instead spend time on gathering qualitative information (or even quantitative information that don't fit neatly in the framework of cost-effectiveness calculations or updating) that could be much more informative for decision making. This is along the lines of this post and seems to also fit the current approach of the Open Philanthropy Project (of utilizing qualitative evidence rather than relying on quantitative estimates). Of course this is all based on the current state of such quantitative modeling, e.g. how little we understand how updating works as well as how to select speculative inputs for the quantitative models (and my judgment about how hard it would be to try to improve on these fronts). There could be a drastically better version of such quantitative prioritization that I haven't been able to imagine.

It could be very valuable to construct a quantitative model (or parts of one), think about the inputs and their values, etc., for reasons explained here. E.g. The MIRI model (in particular some inputs by Paul Christiano; see here) has really helped me realize the importance of AI safety. So does the "astronomical waste" argument, which gives one a sense of the scale even if one doesn't take the numbers literally. Still, when I make a decision of whether to donate to MIRI I wouldn't rely on a quantitative model (at least one like what I built) and would instead put a lot of weight on qualitative evidence that is likely impossible (for us yet) to model quantitatively.

I'm still undecided on the question of whether quantitative models can actually work better than qualitative analysis. (Indeed, how can you even ever know which works better?) But very few people actually use serious quantitative models to make decisions--even if quantitative models ultimately don't work as well as well-organized qualitative analysis, they're still underrepresented--so I'm happy to see more work in this area.

Some suggestions on ways to improve the model:

## Account for missing components

Quantitative models are hard, and it's impossible to construct a model that accounts for everything you care about. I think it's a good idea to consider which parts of reality you expect to matter most for the impact of a particular thing, and try to model those. Whatever your model is missing, try to figure out which parts of

thatmatter most. You might decide that some things are too hard to model, in which case you should consider how those hard-to-model bits will likely affect the outcome and adjust your decision accordingly.Examples of major things left out:

## Sensitivity analysis

The particular ordering you found (80K > MIRI > ACE > StrongMinds) depends heavily on certain input parameters. For example, for your MIRI model, "expected value of the far future" is doing tons of work. It assumes that the far future contains about 10^17 person-years; I don't see any justification given. What if it's actually 10^11? Or 10^50? This hugely changes the outcome. You should do some sensitivity analysis to see which inputs matter the most. If any one input matters too much, break it down into less sensitive inputs.

Retrospective analysis of track record? Looking into Tetlock-style research?

Suppose it's 10 years in the future, and we can look back at what ACE and MIRI have been doing for the past 10 years. We now know some new useful information, such as:

But even then, we still don't know nearly as much as we'd like. We don't know if ACE

reallymoved money, or if that money would have been donated to animal charities anyway. Maybe MIRI took funding away from other research avenues that would have been more fruitful. We still have no idea how (dis)valuable the far future will be.How do you do the "human equivalent" part?

The process for this is described in the post for each model. I'll be happy to clarify if you still have questions after reading that.

Found it, thanks!

Do you have these numbers published, broken down by staff member?

It also would be cool to see breakdowns of the HEWALYs/$ for each charity before and after the Bayesian update with the prior.

No, we only have our group estimate published.

To see HEWALYs/$ before updating, you can look at the model outputs. You can also get them on our R Shiny app here by simply adding a line to calculate e.g. mean(miri).

How do you edit the model code?