I recently interviewed a member of the EA community - Prof Eva Vivalt - at length about her research into the value of using trials in medical and social science to inform development work. The benefits of 'evidence-based giving' has been a core message of both GiveWell and Giving What We Can since they started.
Vivalt's findings somewhat challenge this, and are not as well known as I think they should be. The bottom line is that results from existing studies only weakly predict the results of similar future studies. They appear to have poor 'external validity' - they don't reliably indicate the measured result an intervention will seem to have in future. This means that developing an evidence base to figure out how well projects will work is more expensive than it otherwise would be.
Perversely, in some cases this can make further studies more informative, because we currently know less than we would if past results generalized well.
Note that Eva discussed an earlier version of this paper at EAG 2015.
Another result that conflicts with messages 80,000 Hours has used before is that experts on average are fairly good at guessing the results of trials (though you need to average many guesses). Aggregating these guesses may be a cheaper alternative to running studies, though the guesses may become worse without trial results to inform them.
Eva's view is that there isn't much alternative to collecting evidence like this - if it's less useful we should just accept that, but continue to run and use studies of this kind.
I'm more inclined to say this should shift our approach. Here's one division of the sources of information that inform our beliefs:
- Foundational priors
- Trials in published papers
- Everything else (e.g. our model of how things work based on everyday experience).
Inasmuch as 2 looks less informative, we should rely more on the alternatives (1 and 3).
Of course Eva's results may also imply that 3 won't generalize between different situations either. In that case, we also have more reason to work within our local environment. It should nudge us towards thinking that useful knowledge is more local and tacit, and less universal and codified. We would then have greater reason to become intimately familiar with a particular organisation or problem and try to have most of our impact through those areas we personally understand well.
It also suggests that problems which can be tackled with published social science may not have as high tractability - relative to alternative problems we could work on - as it first seems.
You can hear me struggle to figure out how much these results actually challenge conventional wisdom in the EA community later in the episode, and I'm still unsure.
For an alternative perspective from another economist in the community, Rachel Glennerster, you can read this article: Measurement & Evaluation: The Generalizability Puzzle. Glennerster believes that generalisability is much less of an issue than does Vivalt, and is not convinced by how she has tried to measure it.
There are more useful links and a full transcript on the blog post associated with the podcast episode.
I also conduct research on the generalizability issue, but from a different perspective. In my view, any attempt to measure effect heterogeneity (and by extension, research generalizability) is scale dependent. It is very difficult to tease apart genuine effect heterogeneity from the appearance of heterogeneity due to using an inappropriate scale to measure the effects.
In order to to get around this, I have constructed a new scale for measuring effects, which I believe is more natural than the alternative measures. My work on this is available on arXiv at https://arxiv.org/abs/1610.00069 . The paper has been accepted for publication at the journal Epidemiologic Methods, and I plan to post a full explanation of the idea here and on Less Wrong when it is published (presumably, this will be a couple of weeks from now).
I would very much appreciate feedback on this work, and as always, I operate according to Crocker's Rules.
I think counterfactual outcome state transition parameters is a bad name in that it doesn't help people identify where and why they should use it, nor does it communicate all that well what it really is. I'd want to thesaurus each of the key terms in order to search for something punchier. You might object that essentially 'marketing' an esoteric statistics concept seems perverse, but papers with memorable titles do in fact outperform according to the data AFAIK. Sucks but what can you do?
I bother to go into this because this research area seems important enough to warrant attention and I worry it won't get it.