Challenges in evaluating forecaster performance

Gregory Lewis

Briefly: There are some good reasons to assess how good you or others are at forecasting (alongside some less-good ones). This is much harder than it sounds, even with a fairly long track record on something like Good Judgement Open to review: all the natural candidates are susceptible to distortion (or 'gaming'). Making comparative judgements more prominent also have other costs. Perhaps all of this should have been obvious at first glance. But it wasn't to me, hence this post.

Motivation

I'm a fan of cultivating better forecasting performance in general and in the EA community in particular. Perhaps one can break down the benefits this way.

Training: Predicting how the future will go seems handy for intervening upon it to make it go better. Given forecasting seems to be a skill that can improve with practice, practising it could be a worthwhile activity.

Community accuracy: History augurs poorly for those who claim to know the future. Although even unpracticed forecasters typically beat chance, they tend inaccurate and overconfident. My understanding (tendentiously argued elsewhere) is taking aggregates of these forecasts - ditto all other beliefs we have (ibid.) - allows us to fare better than we would each out on our own. Forecasting platforms are one useful way to coordinate in such an exercise, and so participation supplies a common epistemic good.[1] Although this good is undersupplied throughout the intellectual terrain, it may be particularly valuable for more 'in house' topics of the EA community, given less in the way of 'outside interest'.

Self-knowledge/'calibration': Knowing one's ability as a forecast be useful piece of self-knowledge. It can inform how heavily we should weigh our own judgement in those rare cases where our opinion comprises a non-trivial proportion of the opinions we are modestly aggregating (ibid. ad nauseam). Sometimes others ask us for forecasts, often under the guise of advice (I have been doing quite a lot of this with ongoing COVID-19 pandemic): our accuracy (absolute or relative) would be useful to provide alongside our forecast, so our advice can be weighed appropriately by its recipient.

Epistemic peer evaluation: It has been known for some to offer their opinion despite their counsel not being invited. In such cases, public disagreement can result. We may be more accurate in adjudicating these disagreements by weighing the epistemic virtue of the opposing 'camps' instead of the balance of argument as it appears to us (ibid. - peccavi).

Alas, direct measures of epistemic accuracy can be elusive: people are apt to better remember (and report) their successes over their failures, and track records from things like prop betting or publicly registered predictions tend low-resolution. Other available proxy measures for performance - subject matter expertise, social status, a writing style suffused with fulminant candenzas of melismatic and mellifluous (yet apropos and adroit) limerence of language [sic] - are inaccurate. Forecasting platforms allow people to make a public track record, and paying greater attention to these track records likely improves whatever rubbish approach is the status quo for judging others' judgement.

Challenges

The latter two objectives require some means of comparing forecasters to one another.[2] This evaluation is tricky for a few reasons:

1. Metrics which allow good inter-individual comparison can interfere with the first two objectives, alongside other costs.

2. Probably in principle (and certainly in practice) natural metrics for this introduce various distortions.

3. (In consequence, said metrics are extremely gameable and vulnerable to Goodhart's law).

Forecasting and the art of slaking one's fragile and rapacious ego

Suppose every EA started predicting on a platform like Metaculus. Also suppose there was a credible means to rank all of them by their performance (more later). Finally, suppose this 'Metaculus rank' became an important metric used in mutual evaluation.

Although it goes without saying effective altruists almost perfectly act to further the common good, all-but-unalloyed with any notion of self-regard, insofar as this collective virtue is not adamantine, perverse incentives arise. Such as:

Fear of failure has a mixed reputation as an aid to learning. Prevalent worry about 'tanking ones rank' could slow learning and improvement, and result in poorer individual and collective performance.
People can be reluctant to compete when they believe they are guaranteed to lose. Whoever finds themselves in the bottom 10% may find excusing themselves from forecasting more appealing than continuing to broadcast their inferior judgement (even your humble author [sic - ad nau- nvm] may not have written this post if he was miles below-par on Good Judgement Open). This is bad for these forecasters (getting better in absolute terms still matters), and for the forecasting community (relatively poorer forecasters still provide valuable information).
Competing over relative rank is zero-sum. To win in zero-sum competition, it is not enough that you succeed - all others must fail. Good reasoning techniques and new evidence are better jealously guarded as 'trade secrets' rather than publicly communicated. Yet the latter helps one another to get better, and for the 'wisdom of the crowd' to be wiser.

Places like GJO and Metaculus are aware of these problems, and so do not reward relative accuracy alone, either through separate metrics (badges for giving your rationale, 'upvotes' on comments, etc.) or making their 'ranking' measures composite metrics of accuracy and other things like activity (more later).[3]

These composite metrics are often better. Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and regular contributor to a prediction platform has typically done something more valuable and praiseworthy than Bob, who was naturally brilliant but only stuck around long enough to demonstrate a track record to substantiate his boasting. Yet, as above, sometimes we really do (and really should) care about performance alone, and would value Bob's judgement over Alice's.

Hacking scoring metrics for minimal fun and illusory profit

Even if we ignore the above, constructing a good metric of relative accuracy is much easier said than done. Even if we want to (as Tetlock recommends) 'keep score' of our performance, essentially all means of keeping score either introduce distortions, are easy to Goodhart, or are uninterpretable. To illustrate, I'll use all the metrics available for participating in Good Judgement Open as examples (I'm not on Metaculus, but I believe similar things apply).

Incomplete evaluation and strategic overconfidence: Some measures are only reported for single questions or a small set of questions ('forecast challenges' in the GJO). This can inadvertently reward overconfidence. 'Best performers' for a single question are typically overconfident (and typically inaccurate) forecasters who maxed out their score by betting 0/100% the day a question opened and got lucky.

Sets of questions ('challenges') do a bit better (good forecasters tend to find themselves frequently near the top of the leader board), but their small number still allows a lot of volatility. My percentile across question sets on GJO varies from top 0.1% to significantly below average. The former was on a set where I was on the 'right' side of the crowd for all dozen of the questions in the challenge. Yet for many of these I was at something like 20% whilst the crowd was at 40% - even presuming I had edge rather than overconfidence, I got lucky that none of these happened. Contrariwise, being (rightly) less highly confident than the crowd will pay out in the long run, but the modal result in a small question-set is getting punished. The latter was a set where I 'beat the crowd' on most of the low probabilities, but tanked on an intermediate probability one - Brier scoring and non-normalized adding of absolute difference means this question explained most of the variance in performance across the set.[4]

If one kept score by ones 'best rankings', ones number of 'top X finishes', or similar, this measure would reward overconfidence, as although this costs you in the long run, it amplifies good fortune.

Activity loading: The leaderboard for challenges isn't ranked by Brier score (more later), but accuracy, essentially your Brier - crowd Brier. GJO evaluates each day, and 'carries forward' forecasts made before (i.e. if you say 52% on Monday, you are counted as forecasting 52% on Tuesday, Wednesday, and every day until the question closes unless you change it). Thus - if you are beating the crowd - your 'accuracy score' is also partly an activity score, as answering all the questions, having active forecasts as soon questions open (etc) all improve ones score (presuming one is typically beating the crowd) without being measures of good judgement per se.

Metaculus ranks all users by a point score which (like this forum's karma system) rewards a history of activity rather than 'present performance': even if Alice was more accurate than all current Metaculus users, if she joined today it would take her a long time to rise to the top them.

Raw performance scores are meaningless without a question set: Happily, GJO uses a fairly pure 'performance metric' front and centre: Brier score across all of your forecasts. Although there are ways to 'grind' activity into accuracy (updating very frequently, having a hair trigger to update on news a couple of days before others get around to it, etc.) it loads much more heavily on performance. It also (at least in the long run) punishes overconfidence.

The problem is that this measure has little meaning on its own. A Brier score of (e.g.) 0.3 may be good or bad depending on how hard it was to forecast your questions - and one can get arbitrarily close to 0 by 'forecasting the obvious' (i.e. putting 100% on 'No' the day before a question closes and the event has not happened yet). One can fairly safely say your performance isn't great if you're underperforming a coin flip, but more than that is hard to say.

Comparative performance is messy with disjoint evaluation sets: To help with this, GJO provides a 'par score' as a benchmark, composed of median performance across other forecasts on the same questions and time periods as one's own. This gives a fairly reliable single bit of information: a Brier score of 0.3 is good if it is lower than this median, as it suggests one has outperformed the typical forecasts on these questions (and vice versa).

However, it is hard to say much more than that. If everyone answered the same questions across the same time periods, one could get a sense of (e.g.) 'how much better than the median', or which of Alice and Bob (who are both 'better than average') is better. But when (as is typically the case on forecasting questions) people forecast disjoint sets of questions, this goes up in the air again. 'How much you beat the median by' can depend as much on 'picking the right questions' as 'forecasting well':

If the median forecast is already close to the forecast you would make, making a forecast can harm your 'average outperformance' even if you are right and the median is wrong (at the extreme, making a forecast identical to the median inevitably drags one closer to par). This can bite a lot for low likelihood questions: if the market is at 2% but you think it should be 8%, even if you're right you are a) typically likely to lose in the short run, and b) even in the long run the yield, when aggregated, can make you look more similar to the average.
On GJO the more political questions tend to attract less able forecasters, and the niche/technical ones more able ones. For questions where the comments are often "I know X is a fighter, they will win this!" or "Definitely Y, as [conspiracy theory]", the bar to 'beat the crowd' is much lower than for questions where the comments are more "I produced this mathematical model for volatility to give a base-rate for exceeding the cut-off value which does well back-tested on the last 2 years - of course, back-testing on training data risks overfitting (schoolboy error, I know), but this also corresponds with the probability inferred from the price of this exotic financial instrument".

This is partly a good thing, as it incentives forecasters to prioritise questions where the current consensus forecast is less accurate. But it remains a bad one for the purposes discussed here: even if Alice in some platonic sense is a superior forecaster, Bob may still appear better if he happens to (or strategizes to) benefit from these trends more than her.[5]

Admiring the problem: an Ode to Goodhart's law

One might have hoped as well describing this problem, I would also have some solutions to propose. Sadly this is beyond me.

The problem seems fundamentally tied to Goodhart's law. (q.v.) People may have a mix of objectives for forecasting, and this mix may differ between people. Even a metric closely addressed to one objective probably will not line up perfectly, and the incentives to 'game' would amplify the divergence. With respect to other important objectives, one expects a greater mismatch: a metric that rewards accuracy can discourage participation (see above); a metric that rewards participation can encourage people to 'do a lot' without necessarily 'adding much' (or trying to get much better). Composite metrics can help with this problem, but provide another one in turn: difficulty isolating and evaluating individual aspects of the composite.

Perhaps best is a mix of practical wisdom and balance - taking the metrics as useful indicators but not as targets for monomaniacal focus. Some may be better at this than others.

[1]: Aside: one skill in forecasting which I think is neglected is formulating good questions. Typically our convictions are vague gestalts rather than particular crisp propositions. Finding useful 'proxy propositions' which usefully inform these broader convictions is an under-appreciated art.

[2]: Relative performance is a useful benchmark for peer weighting as well as self-knowledge. If Alice tends worse than the average person at forecasting, she would be wise to be upfront about this lest her 'all things considered' judgements inadvertently lead others (who would otherwise aggregate better discounting her view rather than giving it presumptive equal weight) astray.

[3]: I imagine it is also why (e.g.) the GJP doesn't spell out exactly how it selects the best GJO users for superforecaster selection.

[4]: One can argue the toss about whether there are easy improvements. One could make a scoring rule more sensitive to accuracy on rare events (Brier is infamously insensitive), or do some intra-question normalisation of accuracy. The downside would be this is intensely gameable, encouraging a 'pick the change up in front of the steamroller' strategy - overconfidently predicting rare events definitely won't happen will typically net one a lot of points, with the occasional massive bust.

[5]: One superforecaster noted taking a similar approach (but see).

22 Reactions

Mentioned in

41Forecasting Newsletter: September 2020.

More posts like this

Comments22

Sorted by

New & upvoted

Click to highlight new comments since: Today at 12:51 PM

Linch4y21

Great post! As you allude to, I'm increasingly of the opinion that the best way to evaluate forecaster performance is via how much respect other forecasters give them. This has a number of problems:

The signal is not fully transparent: people who don't do at least a bit of forecasting (or are otherwise engaged with forecasters) will be at a loss about which forecasters others respect.
The signal is not fully precise: I can give you a list of forecasters I respect and a loose approximation of how much I respect them, but I'd be hard-pressed to give a precise rank ordering.
Forecasters are not immune to common failures of human cognition: we might expect demographic or ideological biases creep in on forecasters' evaluations of each other.

Though at least in the GJP/Metaculus style forecasting, a frequent pattern of (relative) anonymity hopefully alleviates this a lot

There are other systematic biases in subjective evaluation of ability that may diverge from "Platonic" forecasting skill

One that's especially salient to me is that (I suspect) verbal ability likely correlates much more poorly with accuracy than it does with respect.
I also think it's plausible that, especially in conversation, forecasters on average usually overweight complex explanations/nuance more than is warranted by the evidence.

In ML terms, this will be overparameterization

It just pushes the evaluation problem up one level: how do forecasters evaluate each other?

However, as you mention, other metrics have as many if not more problems. So on balance, I think as of 2020, the metric "who do other forecasters respect" currently carries more signal than any other metric I'm aware of.

That said, part of me still holds out hope that "as of 2020" is doing most of the work here. Forecasting in many ways seems to me like a nascent and preparadigm field, and it would not shock me if in 5-15 years we have much better ontologies/tools of measurement so that (as with other more mature fields) more quantified metrics will be better in the domain of forecaster evaluation than loose subjective human impressions.

Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and regular contributor to a prediction platform has typically done something more valuable and praiseworthy than Bob

I think this is an underrated point. Debating praiseworthiness seems like it can get political real fast, but I want to emphasize the point about value: there are different reasons you may care about participation in a forecasting platform, for example:

"ranking" people on a leaderboard, so you can use good forecasters for other projects
you care about the results of the actual questions and the epistemic process used to gain those results.

For the latter use case, I think people who participate regularly on the forecasting platforms, contribute a lot of comments, etc, usually improve group epistemics much more than people who are unerringly accurate on just a few questions.

Metaculus, as you mention, is aware of this, and (relative to GJO) rewards activity more than accuracy. I think this has large costs (in particular I think it makes the leaderboard a worse signal for accuracy), but is still on balance better.

A side note about Goodhart's law: I directionally agree with you, but I think Goodhart's law (related: optimizer's curse, specification gaming) is a serious issue to be aware of, but (as with nuance) I worry that in EA discussions about Goodhart's law there's a risk of being "too clever." At any point you're trying to collapse the complex/subtle/multivariate/multidimensional nature of reality to a small set of easily measurable/quantifiable dimensions (sometimes just one), you end up losing information. You hope that none of the information you lose is particularly important, but in practice this is rarely true.

Nonetheless, it is the case that to (a first approximation), imperfect metrics often work in getting the things you want to get done. For example, the image/speech recognition benchmarks often have glaring robustness holes that are easy to point out, yet I think it's relatively uncontroversial that in many practical use cases, there are a plethora of situations where ML perception classifiers, created in large part by academics and industry optimizing along those metrics, are currently at or will soon approach superhuman quality.

Likewise, in many businesses, a common partial solution for principle-agent problems is for managers to give employees metrics of success (usually gameable ones that are only moderately correlated with the eventual goal of profit maximization). This can result in wasted effort via specification gaming, but nonetheless many businesses still end up being profitable as a direct result of employees having direct targets.

Perhaps best is a mix of practical wisdom and balance - taking the metrics as useful indicators but not as targets for monomaniacal focus. Some may be better at this than others.

I think (as with some of our other "disagreements") I am again violently agreeing with you. Your position seems to be "we should take metrics as useful indicators but we should be worried about taking them too seriously" whereas my position is closer to "we should be worried about taking metrics too seriously, but we should care a lot about the good metrics, and in the absence of good metrics, try really hard to find better ones."

Ofer4y5

The rewarding-more-active-forecasters problem seems severe and I'm surprised it's not getting more attention. If Alice and Bob both forecast the result of an election, but Alice updates her forecast every day (based on the latest polls) while Bob only updates his forecast every month, it doesn't make sense to compare their average daily Brier score.

Misha_Yagudin4y1

Aha, of the top of my head one might go in the directions of (a) TD-learning type of reward; (b) variance reduction for policy evaluation.

After thinking for a few more minutes, it seems that forecasting more often but at random moments shouldn't impact the expected Brier score. But in practice people frequent forecasters are evaluated with respect to a different distribution (which favors information gain/"something relevant just happen") — so maybe some sort of importance sampling might help to equalize these two groups?

Ofer4y1

After thinking for a few more minutes, it seems that forecasting more often but at random moments shouldn't impact the expected Brier score.

In my toy example (where the forecasting moments are predetermined), Alice's Brier score for day X will be based on a"fresh" prediction made on that day (perhaps influenced by a new surprising poll result), while Bob's Brier score for that day may be based on a prediction he made 3 weeks earlier (not taking into account the new poll result). So we should expect that the average daily Brier score will be affected by the forecasting frequency (even if the forecasting moments are uniformly sampled).

In this toy example the best solution seems to be using the average Brier score over the set of days in which both Alice and Bob made a forecast. If in practice this tends to leave us with too few data points, a more sophisticated solution is called for. ~~(Maybe partitioning days into bins and sampling a random forecast from each bin? [EDIT: this mechanism can be gamed.])~~

alex lawsen (previously alexrjl)4y3

The long-term solution here is to allow forecasters to predict functions rather than just static values. This solves problems of things like people needing to update for time left.

In terms of the specific example though, I think if a significant new poll comes out and Alice updates and Bob doesn't, Alice is a better forecaster and deserves more reward than Bob.

Ofer4y2

The long-term solution here is to allow forecasters to predict functions rather than just static values. This solves problems of things like people needing to update for time left.

Do these functions map events to conditional probabilities? (I.e. mapping an event to the probability of something conditioned on that event happening)? How will this look like for the example of forecasting an election result?

In terms of the specific example though, I think if a significant new poll comes out and Alice updates and Bob doesn't, Alice is a better forecaster and deserves more reward than Bob.

Suppose Alice encountered the important poll result because she was looking for it (as part of her effort to come up with a new forecast). At the end of the day what we really care about is how much weight we should place on any given forecast made by Alice/Bob. We don't directly care about the average daily Brier score (which may be affected by the forecasting frequency). [EDIT: this isn't true if the forecasting platform and the forecasters' incentives are the same when we evaluate the forecasters and when we ask the questions we care about.]

Linch4y3

Suppose Alice encountered the important poll result because she was looking for it (as part of her effort to come up with a new forecast).

This makes Alice a better forecaster, at least if the primary metric is accuracy. (If the metric includes other factors like efficiency, then we need to know eg. how many more minutes, if any, Alice spends than Bob).

At the end of the day what we really care about is how much weight we should place on any given forecast made by Alice/Bob.

If Alice updates daily and Bob updates once a month, and Alice has a lower average daily Brier score, then all else being equal, if you saw their forecasts at a random day, you should trust Alice's forecasts more*.

If you happen to see their forecasts on the day Bob updates, I agree this is a harder comparison, but I also don't think this is an unusually common use case.

I think part of the thing driving our intuition differences here is that I think lack of concurrency of forecasts (timeliness of opinions) is often a serious problem "in real life," rather than just an artifact of the platforms. In other words, you are imagining that whether to trust Alice at time t vs Bob at time t-1 is an unfortunate side effect of forecasting platforms, and "in real life" you generally have access to concurrent predictions by Alice and Bob. Whereas I think the timeliness tradeoff is a serious problem in most attempts to get accurate answers.

If you're trying to decide whether eg, a novel disease is airborne, you might have the choice of a meta-analysis from several months back, an expert opinion from 2 weeks ago, a prediction market median that was closed last week, or a single forecaster's opinion today.

___

Griping aside, I agree that there are situations where you do want to know "conditional upon two people making a forecast at the same time, whose forecasts do I trust more?" There are different proposed and implemented approaches around this, for example prediction markets implicitly get around this problem since the only people trading are people who implicitly believe that their forecasts are current, so the latest trades reflect the most accurate market beliefs, etc. (though markets have other problems like greater fool, especially since the existing prediction markets are much smaller than other markets).

*I've noticed this in myself. I used to update my Metaculus forecasts several times a week, and climbed the leaderboard fairly quickly in March and April. I've since slowed down to averaging an update once 3-6 weeks for most questions (except for a few "hot" ones or ones I'm unusually interested in). My score has slipped as a result. On the one hand I think this is a bit unfair since I feel like there's an important "meta" sense in which I've gotten better (more intuitive sense of probability, more acquired subject matter knowledge on the questions I'm forecasting). On the other, I think there's a very real sense that alex alludes to in which LinchSeptember is just a worse object-level forecaster than LinchApril, even if in some important meta-level ones (I like to imagine) I've gotten better.

Ofer4y2

This makes Alice a better forecaster

As long as we keep asking Alice and Bob questions via the same platform, and their incentives don't change, I agree. But if we now need to decide whether to hire Alice and/or Bob to do some forecasting for us, comparing their average daily Brier score is problematic. If Bob just wasn't motivated enough to update his forecast every day like Alice did, his lack of motivation can be fixed by paying him.

Misha_Yagudin4y1

Here is a sketch of a formal argument, which will show that freshness doesn't matter much.

Let's calculate the average Brier score of a forecaster. We can see the contribution of hypothetical forecasts on day toward sum: $Brier score of the forecast made on day d \times E [num. days the forecast d is active]$ . If forecasts are sufficiently random the expected number of days forecasts are active should be equal. Because $\sum_{d} E [num. days the forecast d is active] = total number of active days$ , expected average Brier score is equal to the average of Briers scores for all days.

axioman4y3

I'm also not sure I follow your exact argument here. But frequency clearly matters whenever the forecast is essentially resolved before the official resolution date, or when the best forecast based on evidence at time t behaves monotonically (think of questions of the type "will event Event x that (approximately) has a small fixed probability of happening each day happen before day y?", where each day passing without x happening should reduce your credence).

Misha_Yagudin4y1

I mildly disagree. I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn't depend on frequency/number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn't matter.

ETA: I think the assumption of "forecasts have an equal expected number of active days" breaks around the closing date, which impacts things in the monotonical example (this effect is linear in the expected number of active days and could be quite big in extremes).

Gregory Lewis4y3

I'm afraid I'm also not following. Take an extreme case (which is not that extreme given I think 'average number of forecasts per forecaster per question on GJO is 1.something). Alice predicts a year out P(X) = 0.2 and never touches her forecast again, whilst Bob predicts P(X) = 0.3, but decrements proportionately as time elapses. Say X doesn't happen (and say the right ex ante probability a year out was indeed 0.2). Although Alice > Bob on the initial forecast (and so if we just scored that day she would be better), if we carry forward Bob overtakes her overall [I haven't checked the maths for this example, but we can tweak initial forecasts so he does].

As time elapses, Alice's forecast steadily diverges from the 'true' ex ante likelihood, whilst Bob's converges to it. A similar story applies if new evidence emerges which dramatically changes the probability, if Bob updates on it and Alice doesn't. This seems roughly consonant with things like the stock-market - trading off month (or more) old prices rather than current prices seems unlikely to go well.

Misha_Yagudin4y2

Thanks, everyone, for engaging with me. I will summarize my thoughts and would likely not actively comment here anymore:

I think the argument holds given the assumption [(a) probability to forecast on each day are proportional for the forecasters (previously we assumed uniformity) + (b) expected number of active days] I made.
- > I think intuition to use here is that the sample mean is an unbiased estimator of expectation (this doesn't depend on the frequency/number of samples). One complication here is that we are weighing samples potentially unequally, but if we expect each forecast to be active for an equal number of days this doesn't matter.
The second assumption seems to be approximately correct assuming the uniformity but stops working on the edge [around the resolution date], which impacts the average score on the order of .
- This effect could be noticeable, this is an update.
Overall, given the setup, I think that forecasting weekly vs. daily shouldn't differ much for forecasts with a resolution date in 1y.
I intended to use this toy model to emphasize that the important difference between the active and semi-active forecasters is the distribution of days they forecast on.
This difference, in my opinion, is mostly driven by the 'information gain' (e.g. breaking news, pull is published, etc).
- This makes me skeptical about features s.a. automatic decay and so on.
- This makes me curious about ways to integrate information sources automatically.
- And less so about notifications that community/followers forecasts have significantly changed. [It is already possible to sort by the magnitude of crowd update since your last forecast on GJO].

On a meta-level, I am

Glad I had the discussion and wrote this comment :)
Confused about people's intuitions about the linearity of EV.
- I would encourage people to think more carefully through my argument.
This makes me doubt I am correct, but still, I am quite certain. I undervalued the corner cases in the initial reasoning. I think I might undervalue other phenomena, where models don't capture reality well and hence triggers people's intuitions:
- E.g. randomness of the resolution day might magnify the effect of the second assumption not holding, but it seems like it shouldn't be given that in expectation one resolves the question exactly once.
Confused about not being able to communicate my intuitions effectively.
- I would appreciate any feedback [not necessary on communication], I have a way to submit it anonymously: https://admonymous.co/misha

Misha_Yagudin4y1

This example is somewhat flawed (because forecasting only once breaks the assumption I am making) but might challenge your intuitions a bit :)

Ofer4y1

I didn't follow that last sentence.

Notice that in the limit it's obvious we should expect the forecasting frequency to affect the average daily Brier score: Suppose Alice makes a new forecast every day while Bob only makes a single forecast (which is equivalent to him making an initial forecast and then blindly making the same forecast every day until the question closes).

Misha_Yagudin4y2

re: limit — a nice example. Please notice, that Bob makes a forecast on a (uniformly) random day, so when you take an expectation over the days he is making forecasts on you get the average of scores for all days as if he forecasted every day.

Let be the number of total days, $P_{d} = \frac{1}{N}$ be the probability Bob forecasted on a day $d$ , ${Brier}_{d}$ be the brier score of the forecast made on day $d$ :

$\begin{matrix} E avg. Brier & = \sum d P_{d} \times \frac{{Brier}_{d} \times num. days forecast will be active}{total num. of active days} = \sum d P_{d} \times \frac{{Brier}_{d} \times (N - d)}{N - d} = \sum d P_{d} \times {Brier}_{d} = \frac{\sum {Brier}_{d}}{N} \end{matrix} .$

I am a bit surprised that it worked out here because it breaks the assumption of the equality of the expected number of days forecast will be active. Lack of this assumption will play out if when aggregating over multiple questions [weighted by the number of active days]. Still, I hope this example gives helpful intuitions

Ofer4y1

Thanks for the explanation!

I don't think this formal argument conflicts with the claim that we should expect the forecasting frequency to affect the average daily Brier score. In the example that Flodorner gave where the forecast is essentially resolved before the official resolution date, Alice will have perfect daily Brier scores: , for any $d > N^{'}$ , while in those days Bob will have imperfect Brier scores: ${Brier}_{d} = B r i e r_{N^{'}}$ .

Misha_Yagudin4y1

Thanks for challenging me :) I wrote my takes after this discussion above.

axioman4y1

Do you have a source for the "carrying forward" on gjopen? I usually don't take the time to update my forecasts if I don't think I'd be able to beat the current median but might want to adjust my strategy in light of this.

Misha_Yagudin4y2

Also, because the Median score is the median of all Brier scores (and not Brier score of the median forecast) it might still be good for your Accuracy score to forecast something close to community's median.

Misha_Yagudin4y2

https://www.gjopen.com/faq says:

To determine your accuracy over the lifetime of a question, we calculate a Brier score for every day on which you had an active forecast, then take the average of those daily Brier scores and report it on your profile page. On days before you make your first forecast on a question, you do not receive a Brier score. Once you make a forecast on a question, we carry that forecast forward each day until you update it by submitting a new forecast.

axioman4y2

I guess you're right (I read this before and interpreted "active foreast" as "forecast made very recently").

If they also used this way of scoring things for the results in Superforecasting, this seems like an important caveat for forecasting advice that is derived from the book: For example the efficacy of updating your beliefs might mostly be explained by this. I previously thought that the results meant that a person who forecasts a question daily will make better forecasts on sundays than a person who only forecasts on sundays.