42

AlexMennen comments on My current thoughts on MIRI's "highly reliable agent design" work - Effective Altruism Forum

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (52)

You are viewing a single comment's thread. Show more comments above.

Comment author: AlexMennen 10 July 2017 06:11:08AM *  2 points [-]

There's a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage. In that case, the fact that it takes the AI a long time to achieve a decisive strategic advantage wouldn't do us much good, since we would not pick up an indication that anything was amiss during that period.

Reasons an AI might act in a desirable manner before but not after achieving a decisive strategic advantage:

Prior to achieving a decisive strategic advantage, the AI relies on cooperation with humans to achieve its goals, which provides an incentive not to act in ways that would result in it getting shut down. An AI may be capable of following these incentives well before achieving a decisive strategic advantage.

It may be easier to give an AI a goal system that aligns with human goals in familiar circumstances than it is to give it a goal system that aligns with human goals in all circumstances. An AI with such a goal system would act in ways that align with human goals if it has little optimization power but in ways that are not aligned with human goals if it has sufficiently large optimization power, and it may attain that much optimization power only after achieving a decisive strategic advantage (or before achieving a decisive strategic advantage, but after acquiring the ability to behave deceptively, as in the previous reason).

Comment author: Kaj_Sotala 10 July 2017 06:45:32PM 2 points [-]

There's a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage.

That's assuming that the AI is confident that it will achieve a DSA eventually, and that no competitors will do so first. (In a soft takeoff it seems likely that there will be many AIs, thus many potential competitors.) The worse the AI thinks its chances are of eventually achieving a DSA first, the more rational it becomes for it to risk non-cooperative action at the point when it thinks it has the best chances of success - even if those chances were low. That might help reveal unaligned AIs during a soft takeoff.

Interestingly this suggests that the more AIs there are, the easier it might be to detect unaligned AIs (since every additional competitor decreases any given AI's odds of getting a DSA first), and it suggests some unintuitive containment strategies such as explicitly explaining to the AI when it would be rational for it to go uncooperative if it was unaligned, to increase the odds of unaligned AIs really risking hostile action early on and being discovered...

Comment author: Daniel_Eth 19 July 2017 03:41:40AM 0 points [-]

Or it could just assume the AI has an unbounded utility function (or bounded very highly). An AI could guess it only has a 1 in 1/B chance of reaching DSA, but that the payoff from reaching this is 100B higher than defecting early. Since there are 100B stars in the galaxy, it seems likely that in a multipolar situation with decent diversity of AIs, some would fulfill this criteria and decide to gamble.