You are viewing a comment permalink. View the original post to see all
comments and the full post content.
You are viewing a single comment's thread.
"So far, we haven't found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system's part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off"
Wouldn't this potentially have another negative effect of giving the system an incentive to "expect" an unjustifiably high probability of successfully filling the cauldron? That way if the button is pressed and it's suspended, it gets a higher reward than if it expected a lower chance of success. This is basically an example of reward hacking.
© 2017 Effective Altruism Forum |
Powered by reddit