My criticism of the HRAD research project is that it has no empirical feedback mechanisms and that the ignored physical aspect of computation can have a large impact on the type of systems you think about and design.
I think people thinking highly formally about AI systems might be useful as long as the real world can be used to constrain their thinking
Subscribe to RSS Feed
Yeah, and also (differentially) more promising than AI strategy or AI policy work. But I'm not sure how strong the effect is.
In a hard / unexpected takeoff scenario, it's more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system.
In contrast, if we think there's no such discontinuity and AI development will be gradual, then AI control may be at least somewhat more similar (but surely not entirely comparable) to how we "align" contemporary software systems. That is, it would be more plausible that we could test advanced AI systems extensively without risking catastrophic failure or that we could iteratively try a variety of safety approaches to see what works best.
It would also be more likely that we'd get warning signs of potential failure modes, so that it's comparatively more viable to work on concrete problems whenever they arise, or to focus on making the solutions to such problems scalable – which, to my understanding, is a key component of Paul's approach. In this picture, successful alignment without understanding the theoretical fundamentals is more likely, which makes non-HRAD approaches more promising.
My personal view is that I find a hard and unexpected takeoff unlikely, and accordingly favor other approaches than HRAD, but of course I can't justify high confidence in this given expert disagreement. Similarly, I'm not highly confident that the above distinction is actually meaningful.
I'd be interested in hearing your thoughts on this!
Thanks Tobias.
FWIW, I'm not ready to cede the "more principled" ground to HRAD at this stage; to me, it seems like the distinction is more about which aspects of an AI system's behavior we're specifying manually, and which aspects we're setting it up to learn. As far as trying to get everything right the first time, I currently favor a corrigibility kind of approach, as I described in 3c above -- I'm worried that trying to solve everything formally ahead of time will actually expose us to more risk.