Bottlenecks to more impactful crowd forecasting

elifland

This is a linkpost for https://www.foxy-scout.com/bottlenecks/

Summary

I have an intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization and AI strategy.
I’m not aware of many example success stories of crowd forecasts impacting important decisions, so I define a simple framework for how crowd forecasts could be impactful:
1. Forecasting questions get written such that their forecasts will affect important decisions.
2. The forecasts are accurate and trustworthy enough.
3. Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasts to inform decision making.
I describe illustrations of issues with achieving (1) and (2) above.
I discuss 3 bottlenecks to success stories and possible solutions:

Background

I’ve been forecasting at least semi-actively on Metaculus since Mar 2020, GJO and Foretell since mid-2020. I have an intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization and AI strategy, but feel like I haven’t seen many success stories of successful use on these important questions. When I say crowd forecasting, I’m referring mostly to platforms like Metaculus, Foretell and Good Judgment but prediction markets like Kalshi and PredictIt share many similar features.

I feel very uncertain about many high-level questions like AI strategy and find it hard to understand how others seem comparatively confident. I’ve been reflecting recently on what would need to happen to bridge the gap in my mind between “crowd forecasting seems like it could be useful” and “crowd forecasting is useful for informing important decisions”. This post represents my best guesses on some ways to bridge that gap.

Success story

Framework

I’ll define the primary impact of forecasting based on how actionable the forecasts are: forecasts are impactful to the extent that they directly affect important decisions. This is the pathway discussed as part of the improving institutional decision making cause area. Note that there are other impact pathways, but I believe this is the one with the most expected impact.

When I say “important decisions”, I’m mostly thinking of important decisions from an EA perspective. Some examples:

What causes should 80,000 Hours recommend as priority paths for people to work in?
How should Open Philanthropy allocate grants within global health & development?
Should AI alignment researchers be preparing more for a world with fast or slow takeoff?
What can the US government do to minimize risk of a catastrophic pandemic?

Crowd forecasting’s success story can then be broken down into:

Forecasting questions get written such that their forecasts will affect important decisions.
1. Bottleneck here: Creating important questions.
The forecasts are accurate and trustworthy enough.
1. Bottlenecks here: Incentivizing time spent on important questions and Incentivizing forecasters to collaborate.
Institutions and individuals (stakeholders) making important decisions are willing to use crowd forecasts to inform decision making.
1. I’m skeptical that this is presently the primary bottleneck in the EA community because I’ve observed cases where stakeholders seemed to be interested in using crowd forecasting and useful forecasts didn’t seem to be produced, though many reasonable people seem to disagree. See the illustration of issues for more.
2. Either way, I’ll focus on bottlenecks in (1) as it is where I have the most experience.

Examples of successes

Potential positive examples of the success story described above thus far include (but aren’t limited to):

Metaculus El Paso COVID predictions being useful for El Paso government (according to this Forbes article).
Metaculus tournament supporting the decision-making of the Virginia Department of Health (h/t Charles Dillon). See video presentation.
Metaculus + GJO COVID predictions being sent to CDC, see e.g. Forecast: the Impacts of Vaccines and Variants on the US COVID Trajectory.
1. Though I’m not sure to what extent this actually affected the CDC’s decisions.

Illustration of issues

My impression is that even when organizations are enthusiastic to use crowd forecasting to inform import decisions, it’s difficult for them to get high-quality action-relevant forecasts. Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced from the following questions as a substantial input into important decisions.

A few examples:

[Disclaimer: These are my personal impressions. Creating impactful questions and incentivizing forecaster effort is really hard and I respect OP//RP/Metaculus a lot for giving it a shot, and would love to be proven wrong about the impact of current initiatives like these]

The Open Philanthropy/Metaculus Forecasting AI Progress Tournament is the most well-funded initiative I know of, potentially besides those contracting Good Judgment superforecasters, but my best guess is that the forecasts resulting from it will not be impactful. An example is the "deep learning" longest time horizon round, where despite Metaculus' best efforts most questions have no-few comments and at least to me it felt like the bulk of the forecasting skill was forming a continuous distribution from trend extrapolation. See also this question where the community failed to fully update on record-breaking scores appropriately. Also note that each question attracted only 25-35 forecasters.
Rethink Priorities’ animal welfare questions authored by Neil Dullaghan seem to have the majority of comments on them by Neil himself. I feel intuitively skeptical that most of the 25-45 forecasters per question are doing more than skimming and making minor adjustments to the current community forecast, and this feels like an area where getting up to speed on domain knowledge is important to accurate forecasts.

Bottlenecks

Note: Each bottleneck could have posts for itself, I’ll aim to give my rough thoughts on each one here.

Creating the important questions

In general, I think it’s just really hard to create questions on which crowd forecasting will be impactful

Heuristics for impactful questions

Some heuristics that in my view correlate with more impactful questions:

Targets an area of disagreement (“crux”) that, if resolved, would have a clear path to affecting actions.
Targets an important decision from an EA lens.
1. i.e. in an EA cause area, or cause prioritization.
All else equal, near-term is better than long-term.
1. It’s much harder to incentivize good predictions on long-term questions.
2. Quicker resolution can help us determine more quickly which world models are correct (e.g. faster vs. slower AI timelines).
  1. Similarly, near-term questions provide quicker feedback to forecasters on how to improve and to stakeholders on which forecasters are more accurate.
In a domain where generalists can do well relatively quickly.
1. If a question requires someone to do anywhere close to the work of a report from scratch to make a coherent initial forecast, it’s unrealistic to expect this to happen.
Requiring open-ended/creative reasoning rather than formulaic.
1. We already have useful automatic forecasts for questions for which we have lots of applicable past data. Crowd-based judgmental forecasts are likely more useful in cases requiring more qualitative reasoning, e.g. when we have little past data, good reason to expect trends to break, or the right reference class is unclear.

Failure modes

The following are some failure modes I’ve observed with some existing questions that haven’t felt impactful to me.

Too much trend extrapolation resulting in questions that are hard to imagine being actionable.
1. See e.g. this complaint regarding Forecasting AI Progress.
Not targeting near-term areas of disagreement.
1. Helpful to target near-term areas of disagreement to determine which of various “camp”s’ world models are more accurate.
2. When questions aren’t framed around a disagreement between people’s models, it’s often hard to know how to update on results.
Too long-term / abstract to feel like you can trust without associated reasoning (e.g. AI timelines without associated reasoning).
Requires too much domain expertise / time.
1. I mostly conceptualize crowd forecasting as an aggregator of opinions given existing research for important questions, since ideally independent of crowd forecasting there is rigorous research going on in the area.
  1. Example: People can read about Ajeya and Tom’s AI forecasts and aggregate these + other considerations into an AI timeline prediction.
  2. But if a question requires someone to do anywhere close to the work of a report from scratch, it’s unrealistic to expect this to happen.
2. My sense is that for a lot of questions on Metaculus, forecasters don’t really have enough expertise to know what they’re doing.

See this related article: An estimate of the value of Metaculus questions.

Solution ideas

Experiment with best practices for creating impactful questions. Some ideas:
1. Idea for question creation process: double crux creation.
  1. E.g. I get someone whose TAI median is 2030, and someone whose TAI median is >2100, and try to find double cruxes as near-term as possible for question creation.
  2. Get signal on whose world model is more correct, + get crowd’s opinion on as concrete a thing as possible.
  3. Related: decomposing questions into multiple smaller questions to find cruxes.
    1. Foretell has done this with stakeholders to predict the future of the DoD-Silicon Valley relationship (see report).
2. Early warning signs for e.g. pandemics, nukes.
3. Orgs decompose decision making then release either all questions or some sub-questions to crowd forecasting.
  1. For big questions, some sub-questions will be better fit for crowd forecasting than others; good for orgs to have an explicit plan on how forecasts will be incorporated into decision-making
    1. This may already be happening to some extent but transparency seems good + motivating for forecasters
  2. Metaculus Causes like Feeding Humanity in collaboration with GFI is a great step in the right direction in terms of explicitly aiding orgs’ decision-making
    1. Ideally there’d be more transparency about which questions will affect GFI’s decision-making in which ways
  3. Someone should translate the most important cruxes discovered by MTAIR into forecasting questions.
4. Include text in the question explaining why question matters, what disagreement exists as in Countervalue Detonations by the US by 2050?
5. More research into conditional questions like the Possible Worlds Series.
Write up best practices for creating impactful questions.
1. https://www.metaculus.com/question-writing/ is a good start but I’d love to see a longer version with more focus on the Explain why your question is interesting or important section, specifically Explain what decisions will be affected differently by your question.
Incentivize impactful question creation e.g. via leaderboards or money.

Incentivizing time spent on important questions

Failure modes

Some questions are orders of magnitude more important than others but usually don’t get nearly orders of magnitude more effort on crowd forecasting platforms. Effort doesn’t track impact well since the incentives aren’t aligned.

Some more details about the incentive issues, including my experience with them:

Metaculus incentive issue: incentive to spend a little time on lots of questions. But really contributing to the conversation on important issues may require hours if not days of research.
1. I spent several months predicting and updating on every Metaculus question resolving within the next 1.5 years to move up the sweet Metaculus leaderboard.
  1. On reflection, I realized I should do deep dives into fewer impactful questions, rather than speed forecasting on many questions.
2. Important questions like Neil’s (action relevant for Rethink Priorities) often don’t get extra attention deserved.
Incentive issue on most platforms: performance on all questions is weighted approximately equally, despite some questions seeming much more impactful than others.

Alignment Problems With Current Forecasting Platforms contains relevant content.

Solution ideas

Some unordered ideas to help with the incentives here:

Give higher weight on leaderboards or cash prizes to more important questions.
1. Note that there could be disagreement over which questions are important, which could be assuaged by (h/t Michael Aird for this point):
  1. Being transparent about importance decision process.
  2. Separate leaderboards for specific tournaments and/or cause areas.
  3. Within tournaments, could give higher weight to some questions.
2. Important questions could be in part determined by people rating questions based on how cruxy they seem.
Metaculus is already moving in the right direction with the Forecasting Causes (e.g. Feeding Humanity) with prizes, the AI progress tournament, and an EA question author.
1. I’d like to see things continue to move in this direction and similar efforts on other platforms.
Foretell itself has a narrower focus and goal than Metaculus and GJOpen, focusing the whole site on an important area.
Good Judgment model: organizations pay for forecasts on particular questions
1. Has strengths and weaknesses, willingness-to-pay is a proxy for impact but ideally there can also be focus on questions for which good forecasts is a public good
Note that we could also just increase total forecaster time. I’m a little skeptical because :
1. It feels more tractable on the margin to shift forecaster weight and feels like a bottleneck that should be improved before scaling up.
2. The pool of very good forecasters might not be that large.
3. (but we should do some of both)

Ideally, we want to incentivize people to do detailed forecasts on important questions like read Ajeya’s draft timelines report and update their AI timelines predictions based on that, writing up their reasoning for others to see, which brings me to…

Incentivizing forecasters to collaborate

Failure modes

Due to the competitive nature of forecasting platforms, forecasters are incentivized to not share reasoning that could help other forecasters do better. Some notes on this:

My personal experience:
1. I noticed that when I shared my reasoning community would trend toward my prediction, so it was better to be silent.
2. I like to think of myself as altruistic but am also competitive, and similar to how I got sucked in by the Metaculus leaderboard to predict shallowly on many questions, I got sucked in to not sharing my reasoning.
An illustrative example: Metaculus AI Progress tournament:
1. At first few forecasters left any comments despite the importance of the topic, then when they required 3 comments forecasters generally left very brief, uninformative comments. See the questions for the final round here.
2. See further discussion here: an important point is that this problem gets worse as monetary incentives are added, which may be needed as part of a solution to incentivizing time spent on important questions.
Another example: In the Hypermind Arising Intelligence tournament, there was very little discussion and transparency of forecasters’ reasoning on the question pages.
Note that incentivizing sharing of reasoning is important for both improving collaboration and thus forecast quality and getting stakeholders to trust forecasts, so solutions to this seem especially valuable.

See Alignment Problems With Current Forecasting Platforms for more on why collaboration isn’t incentivized currently.

Solution ideas

Some undeveloped solution ideas, again unordered:

Collaborative scoring rules such as Shapley values
Prizes for insightful comments
More metrics taking into account comment upvotes
Fortified essay prizes (which Metaculus is trying out)
Prizes for comments which affect others’ forecasts
Encourage forecasters to work in teams which compete against each other. Foretell has a teams feature.
Reducing visibility of comments (h/t Michael Aird). Note these have downsides of less legible expertise for forecasters, less quick collaboration.
1. Only visible to non-forecasting stakeholders (question authors, tournament runners, etc.)
2. Comments become visible X time after they’re made, e.g. 2 weeks later.

Other impact pathways

In my current mental model, crowd forecasting will have something like a power law distribution of impact, where it has the most impact from affecting a relatively small amount of important decisions. Identifying which questions have a large expected impact and producing such questions seems tractable. An alternative impact model is that crowd forecasting will “raise the sanity waterline” and its impact will be more diffuse and hard to measure, making many relatively low-impact decisions better.

Some reasons I’m skeptical of this alternative model:

Some decisions seem so much more important than others.
There’s a fairly large fixed cost to operationalizing a question well and getting a substantial amount of reasonable forecasters to forecast on it, that I’m not very optimistic about reducing, which makes me skeptical about the diffuse, tons-of-questions model of impact.
I have trouble thinking of plausible ways most Metaculus questions will impact decisions that are even low impact.

A few other impact models seem relevant but not the primary path to impact:

Improving forecasters’ reasoning
Vetting people for EA jobs/grants via forecasting performance

Conclusion

I’m excited for further research to be done on using crowd forecasting to impact important decisions.

I’d love to see more work on both:

Estimating the value of crowd forecasting.
Testing approaches for improving the value of crowd forecasting. For example, it would be great if there were a field for studying the creation of impactful forecasting questions.

I’m especially excited about work testing out ways of integrating crowd forecasting with EA-aligned research/modeling, as Michael Aird has been doing with the Nuclear Risk Tournament.

Acknowledgements

This post is revised from an earlier draft outline which I linked to on my shortform. Thanks to Michael Aird, Misha Yagudin, Lizka, Jungwon Byun, Ozzie Gooen, Charles Dillon, Nuño Sempere, and Devin Kim for providing feedback.

MaxRa2y4

Thanks for writing this up, I think those are a bunch of good points.

Regarding „Incentivizing forecasters to collaborate“, I wonder how cheap of a win it could be to make the comment upvoting on Metaculus more like our karma system. On Metaculus I don’t get notified about upvotes and I don’t have a visible overall upvote score.

As a side note, I personally already get more out of people upvoting my comments on Metaculus than out of my ranking, feels much more like I directly contributed to the wisdom of the crowd with my hot take analysis.

Simon_M2y3

You might be interested in both: "Most Likes" and "h-Index" metrics on MetaculusExtras which does have a visible upvote score. (Although I agree it would be nice to have it on Metaculus proper)

Effective Altruism Forum
EA Forum

Bottlenecks to more impactful crowd forecasting

47

Summary

Background

Success story

Framework

Examples of successes

Illustration of issues

Bottlenecks

Creating the important questions

Heuristics for impactful questions

Failure modes

Solution ideas

Incentivizing time spent on important questions

Failure modes

Solution ideas

Incentivizing forecasters to collaborate

Failure modes

Solution ideas

Other impact pathways

Conclusion

Acknowledgements

47

Reactions

More posts like this