To clarify: I don't think it will be especially fruitful to try to ensure AIs are conscious, for the reason you mention: multipolar scenarios don't really work that way, what will happen is determined by what's efficient in a competitive world, which doesn't allow much room to make changes now that will actually persist.
And yes, if a singleton is inevitable, then our only hope for a good future is to do our best to align the singleton, so that it uses its uncontested power to do good things rather than just to pursue whatever nonsense goal it will have been given otherwise.
What I'm concerned about is the possibility that a singleton is not inevitable (which seems to me the most likely scenario) but that folks attempt to create one anyway. This includes realities where a singleton is impossible or close to it, as well as where a singleton is possible but only with some effort made to push towards that outcome. An example of the latter would just be a soft takeoff coupled with an attempt at forming a world government to control the AI - such a scenario certainly seems to me like it could fit the "possible but not inevitable" description.
A world takeover attempt has the potential to go very, very wrong - and then there's the serious possibility that the creation of the singleton would be successful but the alignment of it would not. Given this, I don't think it makes sense to push unequivocally for this option, with the enormous risks it entails, until we have a good idea of what the alternative looks like. That we can't control that alternative is irrelevant - we can still understand it! When we have a reasonable picture of that scenario, then we can start to think about whether it's so bad that we should embark on dangerous risky strategies to try to avoid it.
One element of that understanding would be on how likely AIs are to be conscious; another would be how good or bad a life conscious AIs would have in a multipolar scenario. I agree entirely that we don't know this yet - whether for rabbits or for future AIs - that's part of what I'd need to understand before I'd agree that a singleton seems like our best chance at a good future.
Subscribe to RSS Feed
I haven't thought about this in detail, but you might think that whether the evidence in this section justifies the claim in 3c might depend, in part, on what you think the AI Safety project is trying to achieve.
On first pass, the "learning to reason from humans" project seems like it may be able to quickly and substantially reduce the chance of an AI catastrophe by introducing human guidance as a mechanism for making AI systems more conservative.
However, it doesn't seem like a project that aims to do either of the following:
(1) Reduce the risk of an AI catastrophe to zero (or near zero) (2) Produce an AI system that can help create an optimal world
If you think either (1) or (2) are the goals of AI Safety, then you might not be excited about the "learning to reason from humans" project.
You might think that "learning to reason from humans" doesn't accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can't rule out AI catastrophes with high certainty unless you can "peer inside the machine" so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.
You might think that "learning to reason from humans" doesn't accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want "if we knew more, thought faster, were more the people we wished we were" etc. then the approval of actual humans might, at some point, cease to be helpful.
FWIW, I don't think (1) or (2) plays a role in why MIRI researchers work on the research they do, and I don't think they play a role in why people at MIRI think "learning to reason from humans" isn't likely to be sufficient. The shape of the "HRAD is more promising than act-based agents" claim is more like what Paul Christiano said here:
With a clarification I made in the same thread: