# Charity Evaluators: first model and open questions (Oxford Prioritisation Project)

*2017-04-25*

*By Dominik Peters (with Tom Sittler)*

*Cross posted to the Oxford Prioritisation Project blog. We're centralising all discussion on the Effective Altruism forum. To discuss this post, please comment here.*

**Abstract.** We describe a simple simulation model for the recommendations of a charity evaluator like GiveWell or ACE. The model captures some real-world phenomena, such as initial overconfidence in impact estimates. We are unsure how to choose the parameters of the underlying distributions, and are happy to receive feedback on this.

Charity evaluators, and in particular GiveWell, have been enormously influential and impactful for effective altruists: they seeded the idea of aiming for effectiveness in one’s giving, they incentivised charities to be more transparent and impact-focussed, and (most directly) they have moved dollars donated by effective altruistsa to higher-impact organisations (e.g., Peter Singer seems to have reallocated some of his donations from Oxfam towards AMF).

While GiveWell’s recommendations in the field of global health seem to be relatively robust (not having changed substantially over several years), charity evaluators in fields with more uncertainty about the best giving opportunities could have substantial impact through arriving at better recommendations. Among existing such organisations, Animal Charity Evaluators (ACE) appears to be a natural candidate: evidence for the effectiveness of interventions (such as leafleting) in the animal sector is rather weak, and some of ACE’s standout charities (such as GFI) engage in complicated mixes of activities that are difficult to model and evaluate rigorously.

To see whether ACE may be a good target for our OxPrio donation, we will set up a simple quantitative model of the activities of charity evaluators. The model is cause-neutral (so far), but we will call the generic charity evaluator “ACE” for short. Of course, the model is very much simpler than what the real ACE does in the real world. In the model, ACE gathers more evidence about the impact of various charities over time, and based on the available evidence recommends a single charity to its followers. In each time period, these followers give a fixed amount (say normalised to $1) to the top recommendation. Thus, the amount donated does not change with changing recommendations, and donors do not have an “outside option”.

**Definition of “evidence”.** The evidence gathered by ACE could come in various guises: they could be survey results or RCTs, they could be intuitive impressions after conversations with charity staff, they could be new arguments heard in favour or against a certain intervention, or even just intuitive hunches. The model is agnostic about what type of evidence is used; we only require that it comes in form of a point estimate of the impact (per dollar) of a given charity. For now, we do not model the strength of this evidence, and ACE does not use anything like Bayesian updating. Rather, if ACE has gathered several pieces of evidence in forms of point estimates, ACE will take the average and take this to be an overall estimate.

**The model.** Here, then, is our proposed model in pseudocode form:

● There is a fixed pool of charities that ACE will evaluate.

● **Ground truth.** Each charity in the pool has a true impact (per dollar). For each charity, we decide this true impact by randomly sampling from a lognormal distribution. The true impact of the charity will stay fixed over time.

● For each time period *t = 1, …, T*:

○ **Evidence gathering strategy.** For each charity in the pool, ACE collects a single item of evidence in the form of a point estimate of its impact. We arrive at this piece of evidence by randomly sampling a number from a normal distribution centred at the true impact of the charity.

○ **Recommendation.** For each charity in the pool, we calculate the average of the point estimates that we have collected in this and previous time periods. We select and recommend the charity for which this impact is highest.

○ **Payoff. **ACE’s followers donate $1 to the recommended charity. ACE’s impact in this time period is the difference in the *true* impact of the charity recommended now versus the charity recommended in the previous round.

We have implemented this model using a simple python script that simulates the process.

Two observations:

● **Initial overconfidence.** In most simulation runs, the impact estimate of the recommended charity at time *t = 1* is much higher than the true impact of the charity. This is not surprising: because ACE recommends whichever charity got the highest point estimate, it will almost certainly recommend an organisation for which it has seen badly inflated evidence. Arguably, the behaviour of this model mirrors certain real-life phenomena. For example, GiveWell’s estimate for the cost of saving a life has been revised up several times over the years; and in the animal space a few years ago, both online ads and leafleting seemed like they would have much more impact than what is commonly estimated today.

● **Decreasing returns over time.** The change in true impact of the recommended charity is higher in earlier time rounds than in later ones. For example, with our current parameter choices, funding ACE at time *t = 1* has approximately 10 times as much impact as at time *t = 10*, averaging over 10,000 simulation runs. This is because the estimates get closer to the truth over time, and so the top recommendations will change less frequently, and when they do, the difference in impact will often be small. This observation matches our thought above that funding GiveWell (being relatively mature) seems less impactful than funding the relatively young ACE.

Currently, our model is somewhat underspecified, and some of the modelling choice lack good justification. Some open questions related to this, on which we would appreciate any ideas and feedback:

● What should be the distribution of true impact?

○ A lognormal distribution (according to which the order of magnitude of impact is normally distributed) seems like a sensible choice for the reasons Michael Dickens has outlined.

○ On the other hand, Brian Tomasik has argued that on an all-things-considered view (taking flow-through and other effects into account), overall true impact may be more normally distributed.

○ Location (mu):

■ For global health, this role could be filled by an estimate of the impact of GiveDirectly.

■ Analogously, maybe an unbiased estimate of online ads may be a reasonable choice. But are current estimates biased?

○ Scale (sigma):

■ Is it sensible to use variance information from other contexts? For example, there are various distributions of the impact of global health interventions that are sometimes circulated.

● Evidence distribution

○ Which type of distribution captures this best? Normal or lognormal?

○ How to get the right sigma parameter for this? Maybe look at distributions found in large meta-analyses in other contexts (such as health)?

● Is averaging the right way to aggregate different evidence samples?

○ Probably yes if they are normally distributed. What about lognormal?

○ Is there a simple way to make this process more Bayesian?

● For which charities are samples drawn?

○ Diminishing returns to sampling give a good reason for sampling the charity that has the fewest samples (which is roughly equivalent to sampling all charities in each round)

○ It may be a better use of resources to focus on borderline orgs, since for these, new evidence is most likely to change the recommendation. On the other hand, this strategy may never promote the best charity to the top, if the first few evidence samples were very bad.

## Comments (4)

BestThis looks pretty similar to a model I wrote with Nick Dunkley way back in the 2012 (part 1, part 2). I still stand by that as a reasonable stab at the problem, so I also think your model is pretty reasonable :)

Charity population:

You're assuming a fixed pool of charities, which makes sense given the evidence gathering strategy you've used (see below). But I think it's better to model charities as an unbounded population following the given distribution, from which we can sample.

That's because we do expect new opportunities to arise. And if we believe that the distribution is heavy-tailed, a large amount of our expected value may come from the possibility of eventually finding something

wayout in the tails. In your model we only ever get N opportunities to get a really exceptional charity - after that we are just reducing our uncertainty. I think we want to model the fact that we can keep looking for things out in the tails, even if they maybe don't exist yet.I do think that a lognormal is a sensible distribution for charity effectiveness. The real distribution may be broader, but that just makes your estimate more conservative, which is probably fine. I just did the boring thing and used the empirical distribution of the DCP intervention cost-effectivenss (note: interventions, not charities).

Evidence gathering strategy:

You're assuming that the evaluator does a

lotof evaluating: they evaluate every charity in the pool in every round. In some sense I suppose this is true, in that charities which are not explicitly "investigated" by an evaluator can be considered to have failed the first test by not being notable enough to even be considered. However, I still think this is somewhat unrealistic and is going to drive diminishing returns very quickly, since we're really just waiting for the errors for the various charities settle down so that the best charity becomes apparent.I modelled this as the process as the evaluator sequentially evaluating a single charity, chosen at random (with replacement). This is also unrealistic, because in fact an evaluator won't waste their time with things that are obviously bad, but even with this fairly conservative strategy things turned out pretty well.

I think it's interesting to think what happens when model the pool more explicitly, and consider strategies like investigating the top recommendation further to reduce error.

Increasing scale with money moved:

Charity evaluators have the wonderful feature that their effectiveness scales more or less linearly with the amount of money they move (assuming that the money all goes to their top pick). This is a pretty great property, so worth mentioning.

The big caveat there is room for more funding, or saturation of opportunities. I'm not sure how best to model this. We could model charities as rather "deposits" of effectiveness that are of a fixed size when discovered, and can be exhausted. I don't know how that would change things, but I'd be interested to see! In particular, I suspect it may be important how funding capacity co-varies with effectiveness. If we find a charity with a cost-effectiveness that's 1000x higher than our best, but it can only take a single dollar, then that's not so great.

The fact that sometimes people's estimates of impact are subsequently revised down by several orders of magnitude seems like strong evidence against evidence being normally distributed around the truth. I expect that if anything it is broader than lognormally distributed. I also think that extra pieces of evidence are likely to be somewhat correlated in their error, although it's not obvious how best to model that.

It might depend what we're using the model for.

In general, it does seem reasonable that direct (expected) net impact of interventions should be broader than lognormal, as Carl argued in 2011. On the other hand, it seems like the expected net impact all things considered shouldn't be broader than lognormal. For one argument, most charities probably funge against each other by at least 1/10^6. For another, you can imagine that funding global health improves the quality of research a bit, which does a bit of the work that you'd have wanted done by funding a research charity. These kinds of indirect effects are hard to map. Maybe people should think more about them.

AFAICT, the basic thing for a post like this one to get right is to compare apples with apples. Tom is trying to evaluate various charities, of which some are evaluators. If he's evaluating the other charities on direct estimates, and is not smoothing the results over by assuming indirect effects, then he should use a broader than lognormal assumption for the evaluators too (and they will be competitive). If he's taking into account that each of the other charities will indirectly support the cause of one another (or at least the best ones will), then he should assume the same for the charity evaluators.

I could be wrong about some of this. A couple of final remarks: it gets more confusing if you think lots of charities have negative value e.g. because of the value of technological progress. Also, all of this makes me think that if you're so convinced that flow-through effects cause many charities to have astronomical benefits, perhaps you ought to be studying these effects intensely and directly, although that admittedly does seem counterintuitive to me, compared with working on problems of known astronomical importance directly.

I largely agree with these considerations about the distribution of net impact of interventions (although with some possible disagreements, e.g. I think negative funging is also possible).

However, I actually wasn't trying to comment on this at all! I was talking about the distribution of people's estimates of impact around the true impact for a given intervention. Sorry for not being clearer :/