A model of Animal Charity Evaluators  Oxford Prioritisation Project
By Dominik Peters, Fellow (with Tom Sittler)
Created: 20170513
Crossposted from the Oxford Prioritisation Project blog. We're centralising all discussion on the Effective Altruism forum. To discuss this post, please comment here.
Summary. We describe a simple simulation model for the recommendations of a charity evaluator like Animal Charity Evaluators (ACE). In this model, the charity evaluator is unsure about the true impacts of the charities in a fixed pool, and can reduce its uncertainty by performing costly research, thereby improving the quality of its recommendation (in expectation). Better recommendations lead to better utilisation of the money moved by ACE. We also describe how we converted the model’s output, which is measured in chicken years averted / $ into “Humanequivalent wellbeingadjusted lifeyears” (HEWALYs) / $.
(This post is an updated version of our previous post on this model. We would like to thank the commenters on this post for their helpful suggestions.)
The model
Charity evaluators, and in particular GiveWell, have been enormously influential and impactful for effective altruists: they seeded the idea of aiming for effectiveness in one’s giving, they incentivised charities to be more transparent and impactfocussed, and (most directly) they have moved dollars donated by effective altruists to higherimpact organisations (e.g., Peter Singer seems to have reallocated some of his donations from Oxfam towards AMF).
While GiveWell’s recommendations in the field of global health seem to be relatively robust (not having changed substantially over several years), charity evaluators in fields with more uncertainty about the best giving opportunities could have substantial impact through arriving at better recommendations. Among existing such organisations, Animal Charity Evaluators (ACE) appears to be a natural candidate: evidence for the effectiveness of interventions (such as leafleting) in the animal sector is rather weak, and some of ACE’s standout charities (such as GFI) engage in complicated mixes of activities that are difficult to model and evaluate rigorously.
To see whether ACE may be a good target for our grant, we will set up a simple quantitative model of the activities of charity evaluators. We model ACE, but the model is causeneutral and can be applied to any. Of course, the model is very much simpler than what the real ACE does. In the model, ACE gathers more evidence about the impact of various charities over time, and based on the available evidence recommends a single charity to its followers. In each time period, these followers give a fixed amount to the top recommendation. Thus, the amount donated does not change with changing recommendations, and donors do not have an “outside option”.
Definition of “evidence”. The evidence gathered by ACE could come in various guises: they could be survey results or RCTs, they could be intuitive impressions after conversations with charity staff, they could be new arguments heard in favour or against a certain intervention, or even just intuitive hunches. The model is agnostic about what type of evidence is used; we only require that it comes in form of a point estimate of the impact (per dollar) of a given charity. For now, we do not model the strength of this evidence, and there is no Bayesian updating. Rather, if ACE has gathered several pieces of evidence in forms of point estimates, ACE will take the average and take this to be an overall estimate.
The model. Here, then, is our proposed model in pseudocode form. All model parameters are themselves chosen at random from a lognormal distribution specified by a [5%,95%] confidence interval, like for our other models using guesstimate.

There is a fixed pool of charities that ACE will evaluate, which consists of 10–15 charities. This is approximately the number of top and standout charities that ACE currently recommends.

Ground truth. Each charity in the pool has a true impact (per dollar). For each charity, we decide this true impact by randomly sampling from a lognormal distribution. The parameters are chosen to go through a [5%,95%] confidence interval, where the lower bound is given by ACE’s quantitative estimates of its lowerperforming standout charities (such as VEBU and Vegan Outreach, for about 0.5 years of suffering averted / $) and the upper bound is given by ACE’s estimate for its top charities (Mercy for Animals and The Humane League, at about 10 years of suffering averted / $). The true impact of the charity will stay fixed over time.

For each time period t = 1, …, T:

Evidence gathering strategy. For each charity in the pool, ACE collects a single item of evidence in the form of a point estimate of its impact. We arrive at this piece of evidence by randomly sampling a number from a normal distribution centred at the true impact of the charity. Based on comments on the previous version of this piece, a distribution with wider tails than those of a normal distribution might be preferable, but I did not identify a good choice of another symmetric distribution. The standard deviation of the normal distribution we used was at 10–20 years of suffering averted / $, which is the approximate standard deviation which one can see in ACE’s guesstimate models of total impact.

Recommendation. For each charity in the pool, we calculate the average of the point estimates that we have collected in this and previous time periods. We select and recommend the charity for which this average is highest.

Payoff. ACE’s followers donate approximately $3.5m to the recommended charity. ACE’s impact in this time period is the difference in the true impact of the charity recommended at time t versus the charity recommended at time t1, multiplied by the money moved, divided by their operating costs of approximately $0.3m. The numbers for money moved and research costs approximately follow ACE’s annual report for 2016.

The model is then run for 4–6 rounds, and the impact is calculated for the last round. Most of the time, the top charity does not change in the very last round, so that 0 impact / $ is achieved. Occasionally, the quality of the recommendation decreases, because ACE has sampled wrong data, in which case value is destroyed. More often, the recommendation improves, creating value amplified by ACE’s moneymoved factor of approximately 10x operating costs.
To smooth out our impact estimates, we actually take the average impact over the last 3 rounds in the model, so that the fraction of times in which 0 impact is achieved is smaller. This is to aid in the model aggregation process, where the final impact distribution will be fitted to a (double) lognormal distribution.
The model is then simulated 50,000 times, and the average impact over all model runs is calculated, which comes out at about ~6 years of suffering averted / $. The list of 50,000 impact estimates is then passed to the central aggregation process.
We have implemented this model using a simple python script that simulates the process; the code is available and can be run on repl.it.
Observations
Three observations:

It can be better to just donate to the top charity. For our model parameters, in expectation it is better value to donate to the charity that is currently recommended, rather than help ACE run its next evaluation round. This result is relatively robust to changes in the underlying distribution. The “problem” is that ACE will likely identify prettygood charities very early on, and additional rounds do not lead to much change. With our parameters, donating directly to the recommended organisation is ~30% more costeffective. One can reverse this conclusion by assuming a moneymoved factor (money moved divided by operating costs) that is higher than 10x. This suggests that charity evaluators should focus on increasing their money moved. Of course, one way this can be done is by producing higherquality research that will then attract more donors. Nevertheless, this conclusion surprised us a lot.

Initial overconfidence. In most simulation runs, the impact estimate of the recommended charity at time t = 1 is much higher than the true impact of the charity. This is not surprising: because ACE recommends whichever charity got the highest point estimate, it will almost certainly recommend an organisation for which it has seen badly inflated evidence. Arguably, the behaviour of this model mirrors certain reallife phenomena. For example, GiveWell’s estimate for the cost of saving a life has been revised up several times over the years; and in the animal space a few years ago, both online ads and leafleting seemed like they would have much more impact than what is commonly estimated today.

Decreasing returns over time. The change in true impact of the recommended charity is higher in earlier time rounds than in later ones. For example, with our current parameter choices, funding ACE at time t = 1 has approximately 10 times as much impact as at time t = 10, averaging over 10,000 simulation runs. This is because the estimates get closer to the truth over time, and so the top recommendations will change less frequently, and when they do, the difference in impact will often be small. This observation matches our thought above that funding GiveWell (being relatively mature) seems less impactful than funding the relatively young ACE.
Unit conversion
A final step for this model is a unit conversion. Our models for our three other shortlisted organisations estimate humanequivalent wellbeingadjusted lifeyears created per $, whereas this model estimates chicken years on a factory farm averted per $. How to convert these units is not obvious and controversial. We decided to obtain an “exchange rate” by querying team members’ intuitions, and taking medians.
To aid in obtaining this exchange ratio, we proceed in two steps. First, we asked ourselves how bad a year of life on a factory farm is compared to how good an average healthy year of life is, keeping species fixed. That is, we considered a thought experiment where there are “human factory farms”, and asked how many extra years of healthy life we would demand in order to accept being kept on a factory farm as a human. Reports ranged from 10 to 100 years, with a median of 50 years. Next, we asked how bad a year of life for a chicken on a factory farm is compared to the same situation for a human. That is, we asked for what value of n we would be indifferent between saving n chickens from a factory farm versus saving 1 human from a factory farm. Reports ranged from 10 to 1,000, with a median of 400. These values together can be used to obtain the required exchange rate.
This post was submitted for comment to Animal Charity Evaluators before publication.
Comments (4)
While I think this is a good simplifying assumption, it's incorrect and potentially makes a dramatic change to your model. The reason is that I think this assumption is what implies that "ACE will likely identify prettygood charities very early on, and additional rounds do not lead to much change".
However, I'd view ACE as potentially still building research capacity to eventually evaluate more speculative and harder to understand options (such as the recent recommendation of The Good Food Institute) that previously could not be evaluated and may end up being costeffective.
I also think this capacity will produce lumpy breakthroughs in evaluating costeffectiveness and refining accuracy. Many of these breakthroughs have not happened yet and I could see them potentially dramatically changing the top charity list for ACE.
I don't have strong views on whether ACE is the best place for donations, all charities and causes considered, but I do strongly think that assuming ACE has already hit diminishing returns to research investment is a mistake and I do weakly think that building more research capacity and direct research are the most important investments in the animalinterested EA space.
(Disclaimer: I'm on the board of Animal Charity Evaluators, but only speak for myself here. I do not speak for ACE and I may have (and often do have) differing opinions than the ACE consensus.)
So there are a couple of claims here.
(i) ACE is building research capacity
(ii) ACE having more of this capacity in the future will enable them to evaluate a larger number of charities (including ones that are harder to evaluate).
(iii) ACE having more of this capacity in the future will enable them to evaluate charities with higher expected impact.
I'm not sure whether you're claiming (ii) or (iii) or both. And could you say a bit about what evidence you see for these three claims? Thanks for the useful comment!
I would strongly assert (i) and (ii)  ACE has used money to hire more research staff and a good amount of marginal research staff time is going into harder to evaluate charities (Good Food Institute is the most prominent example, but there's ongoing work into cultured meat, protests, wild animal suffering, etc.).
I would weakly assert (iii)  I don't really know right now whether the EV for these "new" charities is higher (if I did know that, we could argue that maybe they should be recommended now). However, the confidence bounds on something like political work or wild animal suffering, in my personal intuitions, dwarf current ACE top charities in both directions (i.e., a decent chance of being a lot better or a lot worse than current top charities). I hope to see ACE tackle these kinds of things soon!
Tom and Peter:
For an early stage charity like ACE it seems that capacity building is indeed a very important consideration (related to Ben Todd's point about the growth approach). E.g. it would allow them to move much more money later, and at the moment moving not that much money is a reason why they don't look so good in our model. Unfortunately we aren't able to incorporate this in our quantitative model (IMO another reason to look beyond quantitative models for decision making at this point, but people may have ways of incorporating it quantitatively  it won't be hard to make a theoretical model of R&D, but fitting it empirically will be the big challenge).
On (i), Open Phil's Lewis Bollard's recommendation and ACE's own plan make it look like capacity building is something they try to do.
On (ii) and (iii), these have been true for GiveWell historically. E.g. on (iii), last year they added quite a few top charities. But I don't know ACE enough to say if they will grow in the way GiveWell did.