Hide table of contents

Comment Permalink

Thanks Nate!

The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."

This is particularly helpful to know.

We worry about "unknown unknowns", but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.

Can you give an example or two of failure modes or "categories of failure modes that are easy to foresee" that you think are addressed by some HRAD topic? I'd thought previously that thinking in terms of failure modes wasn't a good way to understand HRAD research.

As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don't understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.

I'm confused by this as a follow-up to the previous paragraph. This doesn't look like an example of "focusing on categories of failure modes that are easy to foresee," it looks like a case where you're explicitly not using concrete failure modes to decide what to work on.

“how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?”

I feel like this fits with the "not about concrete failure modes" narrative that I believed before reading your comment, FWIW.

So8res7y7

Can you give an example or two of failure modes or "categories of failure modes that are easy to foresee" that you think are addressed by some HRAD topic? I'd thought previously that thinking in terms of failure modes wasn't a good way to understand HRAD research.

I want to steer clear of language that might make it sound like we’re saying:

X 'We can't make broad-strokes predictions about likely ways that AGI could go wrong.'
X 'To the extent we can make such predictions, they aren't important for informing research directions.'
X 'The best

... (read more)

See in context

My current thoughts on MIRI's "highly reliable agent design" work

by Daniel_Dewey

Jul 7 201723 min read 59

60

Existential riskAI safetyBuilding effective altruismAI alignmentCriticism of effective altruist organizationsDonation writeupCriticism of work in effective altruismMachine Intelligence Research Institute

Frontpage

My current thoughts on MIRI's "highly reliable agent design" work

Interpreting this writeup:

Contents

1. What is "highly reliable agent design"?

2. What's the basic case for HRAD?

3. What do I think about HRAD?

3a. Low credence that HRAD will be applicable (25%?)

3b. HRAD has few advocates among AI researchers

3c. Other research, especially "learning to reason from humans," looks more promising than HRAD (75%?)

3d. MIRI staff are thoughtful, aligned with our values, and have a good track record

4. How much should Open Phil support HRAD work?

59 comments

Interpreting this writeup:

I lead the Open Philanthropy Project's work on technical AI safety research. In our MIRI grant writeup last year, we said that we had strong reservations about MIRI’s research, and that we hoped to write more about MIRI's research in the future. This writeup explains my current thinking about the subset of MIRI's research referred to as "highly reliable agent design" in the Agent Foundations Agenda. My hope is that this writeup will help move the discussion forward, but I definitely do not consider it to be any kind of final word on highly reliable agent design. I'm posting the writeup here because I think this is the most appropriate audience, and I'm looking forward to reading the comments (though I probably won't be able to respond to all of them).

After writing the first version of this writeup, I received comments from other Open Phil staff, technical advisors, and MIRI staff. Many comments were disagreements with arguments or credences stated here; some of these disagreements seem plausible to me, some comments disagree with one another, and I place significant weight on all of them because of my confidence in the commentators. Based on these comments, I think it's very likely that some aspects of this writeup will turn out to have been miscalibrated or mistaken – i.e. incorrect given the available evidence, and not just cases where I assign a reasonable credence or make a reasonable argument that may turn out to be wrong – but I'm not sure which aspects these will turn out to be.

I considered spending a lot of time heavily revising this writeup to take these comments into account. However, it seems pretty likely to me that I could continue this comment/revision process for a long time, and this process offers very limited opportunities for others outside of a small set of colleagues to engage with my views and correct me where I'm wrong. I think there's significant value in instead putting an imperfect writeup into the public record, and giving others a chance to respond in their own words to an unambiguous snapshot of my beliefs at a particular point in time.

What is "highly reliable agent design"?
What's the basic case for HRAD?
What do I think about HRAD?

Low credence that HRAD will be applicable (25%?)
HRAD has few advocates among AI researchers
Other research, especially "learning to reason from humans," looks more promising than HRAD (75%?)
MIRI staff are thoughtful, aligned with our values, and have a good track record

How much should Open Phil support HRAD work?

1. What is "highly reliable agent design"?

I understand MIRI's "highly reliable agent design" work (coined in this research agenda, "HRAD" for short) as work that aims to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way. Here's a non-exhaustive list of research topics in this area:

Epistemology: developing a formal theory of induction that accounts for the facts that an AI system will be implemented in the physical world it is reasoning about ("naturalistic world models") and that other intelligent agents may be simulating the AI system ("benign universal prior").
Decision theory: developing a decision theory that behaves appropriately when an agent's decisions are logically entangled with other parts of the environment (e.g. in the presence of other copies of the agent, other very similar systems, or other agents that can predict the agent), and that can't be profitably threatened by other agents.
Logical uncertainty: developing a rigorous, satisfying theory of probabilistic reasoning over facts that are logical consequences of an agent's current beliefs, but that are too expensive to reason out deductively.
Vingean reflection: developing a theory of formal reasoning that allows an agent to reason with high reliability about similar agents, including agents with considerably more computational resources, without simulating those agents.

To be really satisfying, it should be possible to put these descriptions together into a full and principled description of an AI system that reasons and makes decisions in pursuit of some goal in the world, not taking into account issues of efficiency; this description might be understandable as a modified/expanded version of AIXI. Ideally this research would also yield rigorous explanations of why no other description is satisfying.

2. What's the basic case for HRAD?

My understanding is that MIRI (or at least Nate and Eliezer) believe that if there is not significant progress on many problems in HRAD, the probability that an advanced AI system will cause catastrophic harm is very high. (They reserve some probability for other approaches being found that could render HRAD unnecessary, but they aren't aware of any such approaches.)

I've engaged in many conversations about why MIRI believes this, and have often had trouble coming away with crisply articulated reasons. So far, the basic case that I think is most compelling and most consistent with the majority of the conversations I've had is something like this (phrasing is mine / Holden's):

Advanced AI systems are going to have a huge impact on the world, and for many plausible systems, we won't be able to intervene after they become sufficiently capable.
If we fundamentally "don't know what we're doing" because we don't have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.
Even minor mistakes in an advanced AI system's design are likely to cause catastrophic misalignment.
Because of 1, 2, and 3, if we don't have a satisfying description of how an AI system should reason and make decisions, we're likely to make enough mistakes to cause a catastrophe. The right way to get to advanced AI that does the right thing instead of causing catastrophes is to deeply understand what we're doing, starting with a satisfying description of how an AI system should reason and make decisions.
This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding "unknown unknown" problems.

I also find it helpful to see this case as asserting that HRAD is one kind of "basic science" approach to understanding AI. Basic science in other areas – i.e. work based on some sense of being intuitively, fundamentally confused and unsatisfied by the lack of explanation for something – seems to have an outstanding track record of uncovering important truths that would have been hard to predict in advance, including the work of Faraday/Maxwell, Einstein, Nash, and Turing. Basic science can also provide a foundation for high-reliability engineering, e.g. by giving us a language to express guarantees about how an engineered system will perform in different circumstances or by improving an engineer's ability to design good empirical tests. Our lack of satisfying explanations for how an AI system should reason and make decisions and the importance of "knowing what we're doing" in AI make a basic science approach appealing, and HRAD is one such approach. (I don't think MIRI would say that there couldn't be other kinds of basic science that could be done in AI, but they don't know of similarly valuable-looking approaches.)

We've spent a lot of effort (100+ hours) trying to write down more detailed cases for HRAD work. This time included conversations with MIRI, conversation among Open Phil staff and technical advisors, and writing drafts of these arguments. These other cases didn't feel like they captured MIRI's views very well and were not very understandable or persuasive to me and other Open Phil staff members, so I've fallen back on this simpler case for now when thinking about HRAD work.

3. What do I think about HRAD?

I have several points of agreement with MIRI's basic case:

I agree that existing formalisms like AIXI, Solomonoff induction, and causal decision theory are unsatisfying as descriptions of how an AI system should reason and make decisions, and I agree with most (maybe all) of the ways that MIRI thinks they are unsatisfying.
I agree that advanced AI is likely to have a huge impact on the world, and that for certain advanced AI systems there will be a point after which we won't be able to intervene.
I agree that some plausible kinds of mistakes in an AI system's design would cause catastrophic misalignment.
I agree that without some kind of description of "what an advanced AI system is doing" that makes us confident that it will be aligned, we should be very worried that it will cause a catastrophe.

The fact that MIRI researchers (who are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risks from AI) and some others in the effective altruism community are significantly more positive than I am about HRAD is an extremely important factor to me in favor of HRAD. These positive views significantly raise the minimum credence I'm willing to put on HRAD research being very helpful.

In addition to these positive factors, I have several reservations about HRAD work. In relation to the basic case, these reservations make me think that HRAD isn't likely to be significantly helpful for getting a confidence-generating description of how an advanced AI system reasons and makes decisions.

1. It seems pretty likely that early advanced AI systems won't be understandable in terms of HRAD's formalisms, in which case HRAD won't be useful as a description of how these systems should reason and make decisions.

Note: I'm not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems. It may be that our overall disagreement about HRAD is more about the feasibility of other AI alignment research options (see 3 below), or possibly about strategic questions outside the scope of this document (e.g. to what extent we should try to address potential risks from advanced AI through strategy, policy, and outreach rather than through technical research).

2. HRAD has gained fewer strong advocates among AI researchers than I'd expect it to if it were very promising -- including among AI researchers whom I consider highly thoughtful about the relevant issues, and whom I'd expect to be more excited if HRAD were likely to be very helpful.

Together, these two concerns give me something like a 20% credence that if HRAD work reached a high level of maturity (and relatively little other AI alignment research were done) HRAD would significantly help AI researchers build aligned AI systems around the time it becomes possible to build any advanced AI system.

3. The above considers HRAD in a vacuum, instead of comparing it to other AI alignment research options. My understanding is that MIRI thinks it is very unlikely that other AI alignment research can make up for a lack of progress in HRAD. I disagree; HRAD looks significantly less promising to me (in terms of solving object-level alignment problems, ignoring factors like field-building value) than learning to reason and make decisions from human-generated data (described more below), and HRAD seems unlikely to be helpful on the margin if reasonable amounts of other AI alignment research is done.

This reduces my credence in HRAD being very helpful to around 10%. I think this is the decision-relevant credence.

In the next few sections, I'll go into more detail about the factors I just described. Afterward, I'll say what I think this implies about how much we should support HRAD research, briefly summarizing the other factors that I think are most relevant.

3a. Low credence that HRAD will be applicable (25%?)

The basic case for HRAD being helpful depends on HRAD producing a description of how an AI system should reason and make decisions that can be productively applied to advanced AI systems. In this section, I'll describe my reasons for thinking this is not likely. (As noted above, I'm not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems; nevertheless, it's an important factor in my current beliefs about the value of HRAD work.)

I understand HRAD work as aiming to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way, and ideally to have arguments that no other description is more satisfying. I'll refer to this as a "complete axiomatic approach," meaning that an end result of HRAD-style research on some aspect of reasoning would be a set of axioms that completely describe that aspect and that are chosen for their intrinsic desirability or for the desirability of the properties they entail. This property of HRAD work is the source of several of my reservations:

I haven't found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they'd otherwise face. AIXI and Solomonoff induction are particularly strong examples of work that is very close to HRAD, but don't seem to have been applicable to real AI systems. While I think the most likely explanation for this lack of precedent is that complete axiomatic description is not a very promising approach, it could be that not enough effort has been spent in this direction for contingent reasons; I think that attempts at this would be very informative about HRAD's expected usefulness, and seem like the most likely way that I'll increase my credence in HRAD's future applicability. (Two very accomplished machine learning researchers have told me that AIXI is a useful source of inspiration for their work; I think it's plausible that e.g. logical uncertainty could serve a similar role, but this is a much weaker case for HRAD than the one I understand MIRI as making.) If HRAD work were likely to be applicable to advanced AI systems, it seems likely to me that some complete axiomatic descriptions (or early HRAD results) should be applicable to current AI systems, especially if advanced AI systems are similar to today's.
From conversations with researchers and from my own familiarity with the literature, my understanding is that it would be extremely difficult to relate today's cutting-edge AI systems to complete axiomatic descriptions. It seems to me that very few researchers think this approach is promising relative to other kinds of theory work, and that when researchers have tried to describe modern machine learning methods in this way, their work has generally not been very successful (compared to other theoretical and experimental work) in increasing researchers' understanding of the AI systems they are developing.
It seems plausible that the kinds of axiomatic descriptions that HRAD work could produce would be too taxing to be usefully applied to any practical AI system. HRAD results would have to be applied to actual AI systems via theoretically satisfying approximation methods, and it seems plausible that this will not be possible (or that the approximation methods will not preserve most of the desirable properties entailed by the axiomatic descriptions). I haven't gathered evidence about this question.
It seems plausible that the conceptual framework and axioms chosen during HRAD work will be very different from the conceptual framework that would best describe how early advanced AI systems work. In theory, it may be possible to describe a recurrent neural network learning to predict future inputs as a particular approximation of Solomonoff induction, but in practice the differences in conceptual framework may be significant enough that this description would not actually be useful for understanding how neural networks work or how they might fail.

Overall, this makes me think it's unlikely that HRAD work will apply well to advanced AI systems, especially if advanced AI is reached soon (which would make it more likely to resemble today's machine learning methods). A large portion of my credence in HRAD being applicable to advanced AI systems comes from the possibility that advanced AI systems won't look much like today's. I don't know how to gain much evidence about HRAD's applicability in this case.

3b. HRAD has few advocates among AI researchers

HRAD has gained fewer strong advocates among AI researchers than I'd expect it to if it were very promising, despite other aspects of MIRI's research (the alignment problem, value specification, corrigibility) being strongly supported by a few prominent researchers. Our review of five of MIRI's HRAD papers last year provided more detailed examples of how a small number of AI researchers (seven computer science professors, one graduate student, and our technical advisors) respond to HRAD research; these reviews made it seem to us that HRAD research has little potential to decrease potential risks from advanced AI relative to other technical work with the same goal, though we noted that this conclusion was "particularly tentative, and some of our advisors thought that versions of MIRI’s research direction could have significant value if effectively pursued".

I interpret these unfavorable reviews and lack of strong advocates as evidence that:

HRAD is less likely to be good basic science of AI; I'd expect a reasonable number of external AI researchers recognize good basic science of AI, even if its aesthetic is fairly different from the most common aesthetics in AI research.
HRAD is less likely to be applicable to AI systems that are similar to today's; I would expect applicability to AI systems similar to today's to make HRAD research significantly more interesting to AI researchers, and our technical advisors agreed strongly that HRAD is especially unlikely to apply to AI systems that are similar to today's.

I'm frankly not sure how many strong advocates among AI researchers it would take to change my mind on these points – I think a lot would depend on details of who they were and what story they told about their interest in HRAD.

I do believe that some of this lack of interest should be explained by social dynamics and communication difficulties – MIRI is not part of the academic system, and the way MIRI researchers write about their work and motivation is very different from many academic papers, and both of these could cause mainstream AI researchers to be less interested in HRAD research than they would be if these factors weren't in play. However, I think our review process and conversations with our technical advisors each provide some evidence that this isn't likely to be sufficient to explain AI researchers' low interest in HRAD.

Reviewers' descriptions of the papers' main questions, conclusions, and intended relationship to potential risks from advanced AI generally seemed thoughtful and (as far as I can tell) accurate, and in several cases (most notably Fallenstein and Kumar 2015) some reviewers thought the work was novel and impressive; if reviewers' opinions were more determined by social and communication issues, I would expect reviews to be less accurate, less nuanced, and more broadly dismissive.

I only had enough interaction with external reviewers to be moderately confident that their opinions weren't significantly attributable to social or communication issues. I've had much more extensive, in-depth interaction with our technical advisors, and I'm significantly more confident that their views are mostly determined by their technical knowledge and research taste. I think our technical advisors are among the very best-qualified outsiders to assess MIRI's work, and that they have genuine understanding of the importance of alignment as well as being strong researchers by traditional standards. Their assessment is probably the single biggest data point for me in this section.

Outside of HRAD, some other research topics that MIRI has proposed have been the subject of much more interest from AI researchers. For example, researchers and students at CHAI have published papers on and are continuing to work on value specification and error-tolerance (particularly corrigibility), these topics have consistently seemed more promising to our technical advisors, and Stuart Russell has adopted the value alignment problem as a central theme of his work. In light of this, I am more inclined to take AI researchers' lack of interest in HRAD as evidence about its promisingness than as evidence of severe social or communication issues.

The most convincing argument I know of for not treating other researchers' lack of interest as significant evidence about the promisingness of HRAD research is:

I'm pretty sure that MIRI's work on decision theory is a very significant step forward for philosophical decision theory. This is based mostly on conversations with a very small number of philosophers who I know to have seriously evaluated MIRI's work, partially on an absence of good objections to their decision theory work, and a little on my own assessment of the work (which I'd discard if the first two considerations had gone the other way).
MIRI's decision theory work has gained significantly fewer advocates among professional philosophers than I'd expect it to if it were very promising.

I'm strongly inclined to resolve this conflict by continuing to believe that MIRI's decision theory work is good philosophy, and to explain 2 by appealing to social dynamics and communication difficulties. I think it's reasonable to consider an analogous situation with HRAD and AI researchers to be plausible a priori, but the analogue of point 1 above doesn't apply to HRAD work, and the other reasons I've given in this section lead me to think that this is not likely.

3c. Other research, especially "learning to reason from humans," looks more promising than HRAD (75%?)

How promising does HRAD look compared to other AI alignment research options? The most significant factor to me is the apparent promisingness of designing advanced AI systems to reason and make decisions from human-generated data ("learning to reason from humans"); if an approach along these lines is successful, it doesn't seem to me that much room would be left for HRAD to help on the margin. My views here are heavily based on Paul Christiano's writing on this topic, but I'm not claiming to represent his overall approach, and in particular I'm trying to sketch out a broader set of approaches that includes Paul's. It's plausible to me that other kinds of alignment research could play a similar role, but I have a much less clear picture of how that would work, and finding out about significant problems with learning to reason from humans would make me both more pessimistic about technical work on AI alignment in general and more optimistic that HRAD would be helpful. The arguments in this section are pretty loose, but the basic idea seems promising enough to me to justify high credence that something in this general area will work.

"Learning to reason from humans" is different from the most common approaches in AI today, where decision-making methods are implicitly learned in the process of approximating some function – e.g. a reward-maximizing policy, an imitative policy, a Q-function or model of the world, etc. Instead, learning to reason from humans would involve directly training a system to reason in ways that match human demonstrations or are approved of by human feedback, as in Paul's article here.

If we are able to become confident that an AI system is learning to reason in ways that meet human approval or match human demonstrations, it seems to me that we could also become confident that the AI system would be aligned overall; a very harmful decision would need to be generated by a series of human-endorsed reasoning steps (and unless human reasoning endorses a search for edge cases, edge cases won't be sought). Human endorsement of reasoning and decision-making could not only incorporate valid instrumental reasoning (in parts of epistemology and decision theory that we know how to formalize), but also rules of thumb and sanity checks that allow humans to navigate uncertainty about which epistemology and decision theory are correct, as well as human value judgements about which decisions, actions, short-term consequences, and long-term consequences are desirable, undesirable, or of uncertain value.

Another factor that is important to me here is the potential to design systems to reason and make decisions in ways that are calibrated or conservative. The idea here is that we can become more confident that AI systems will not make catastrophic decisions if they can reliably detect when they are operating in unfamiliar domains or situations, have low confidence that humans would approve of their reasoning and decisions, have low confidence in predicted consequences, or are considering actions that could cause significant harm; in those cases, we'd like AI systems to "check in" with humans more intensively and to act more conservatively. It seems likely to me that these kinds of properties would contribute significantly to alignment and safety, and that we could pursue these properties by designing systems to learn to reason and make decisions in human-approved ways, or by directly studying statistical properties like calibration or "conservativeness".

"Learning to reason and make decisions from human examples and feedback" and "learning to act 'conservatively' where 'appropriate'" don't seem to me to be many orders of magnitude more difficult than the kinds of learning tasks AI systems are good at today. If it was necessary for an AI system to imitate human judgement perfectly, I would be much more skeptical of this approach, but that doesn't seem to be necessary, as Paul argues:

"You need only the vaguest understanding of humans to guess that killing the user is: (1) not something they would approve of, (2) not something they would do, (3) not in line with their instrumental preferences.

So in order to get bad outcomes here you have to really mess up your model of what humans want (or more likely mess up the underlying framework in an important way).

If we imagine a landscape of possible interpretations of human preferences, there is a 'right' interpretation that we are shooting for. But if you start with a wrong answer that is anywhere in the neighborhood, you will do things like 'ask the user what to do, and don’t manipulate them.' And these behaviors will eventually get you where you want to go.

That is to say, the 'right' behavior is surrounded by a massive crater of 'good enough' behaviors, and in the long-term they all converge to the same place. We just need to land in the crater."

Learning to reason from humans is a good fit with today's AI research, and is broad enough that it would be very surprising to me if it were not productively applicable to early advanced AI systems.

It seems to me that this kind of approach is also much more likely to be robust to unanticipated problems than a formal, HRAD-style approach would be, since it explicitly aims to learn how to reason in human-endorsed ways instead of relying on researchers to notice and formally solve all critical problems of reasoning before the system is built. There are significant open questions about whether and how we could make machine learning robust and theoretically well-understood enough for high confidence, but it seems to me that this will be the case for any technical pathway that relies on learning about human preferences in order to act desirably.

Finally, it seems to me that if a lack of HRAD-style understanding does leave us exposed to many important "unknown unknown" problems, there is a good chance that some of those problems will be revealed by failures or difficulties in achieving alignment in earlier AI systems, and that researchers who are actively thinking about the goal of aligning advanced AI systems will be able to notice these failings and relate them to a need for better HRAD-style understanding. This kind of process seems very likely to be applicable to learning to reason from humans, but could also apply to other approaches to AI alignment. I do not think that this process is guaranteed to reveal a need for HRAD-style understanding in the case that it is needed, and I am fairly sure that some failure modes will not appear in earlier advanced AI systems (the failure modes Bostrom calls "treacherous turns", which only appear when an AI system has a large range of general-purpose capabilities, can reason very powerfully, etc.). It's possible that earlier failure modes will be too rare, too late, or not clearly enough related to a need for HRAD-style research. However, if a lack of fundamental understanding does expose us to many important "unknown unknown" failure modes, it seems more likely to me that some informative failures will happen early than that all such failures will appear only after systems are advanced enough to be extremely high-impact, and that researchers motivated by alignment of advanced AI will notice if those failures could be addressed through HRAD-style understanding. (I'm uncertain about how researchers who aren't thinking actively about alignment of advanced AI would respond, and I think one of the most valuable things we can do today is to increase the number of researchers who are thinking actively about alignment of advanced AI and are therefore more likely to respond appropriately to evidence.)

My credence for this section isn't higher for three basic reasons:

It may be significantly harder to build an aligned AI system that's much more powerful than a human if we use learned reasoning rules instead of formally specified ones. Very little work has been done on this topic.
It may be that some parts of HRAD – e.g. logical uncertainty or benign universal priors – will turn out to be necessary for reliability. This currently looks unlikely to me, but seems like the main way that parts of HRAD could turn out to be prerequisites for learning to reason from humans.
Unknown unknowns; my arguments in this section are pretty loose, and little work has been done on this topic.

3d. MIRI staff are thoughtful, aligned with our values, and have a good track record

As I noted above, I believe that MIRI staff are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risk from AI. The fact that some of them are much more optimistic than I am about HRAD research is a very significant factor in favor of HRAD. I think it would be incorrect to place a very low credence (e.g. 1%) on their views being closer to the truth than mine are.

I don't think it is helpful to try to list a large amount of detail here; I'm including this as its own section in order to emphasize its importance to my reasoning. My views come from many in-person and online conversations with MIRI researchers over the past 5 years, reports of many similar conversations by other thoughtful people I trust, and a large amount of online writing about existential risk from AI spread over several sites, most notably LessWrong.com, agentfoundations.org, arbital.com, and intelligence.org.

The most straightforward thing to list is that MIRI was among the first groups to strongly articulate the case for existential risk from artificial intelligence and the need for technical and strategic research on this topic, as noted in our last writeup:

"We believe that MIRI played an important role in publicizing and sharpening the value alignment problem. This problem is described in the introduction to MIRI’s Agent Foundations technical agenda. We are aware of MIRI writing about this problem publicly and in-depth as early as 2001, at a time when we believe it received substantial attention from very few others. While MIRI was not the first to discuss potential risks from advanced artificial intelligence, we believe it was a relatively early and prominent promoter, and generally spoke at more length about specific issues such as the value alignment problem than more long-standing proponents."

4. How much should Open Phil support HRAD work?

My 10% credence that "if HRAD reached a high level of maturity it would significantly help AI researchers build aligned AI systems" doesn't fully answer the question of how much we should support HRAD work (with our funding and with our outreach to researchers) relative to other technical work on AI safety. It seems to me that the main additional factors are:

Field-building value: I expect that the majority of the value of our current funding in technical AI safety research will come from its effect of increasing the total number of people who are deeply knowledgeable about technical research on artificial intelligence and machine learning, while also being deeply versed in issues relevant to potential risks. HRAD work appears to be significantly less useful for this goal than other kinds of AI alignment work, since HRAD has not gained much support among AI researchers. (I do think that in order to be effective for field-building, AI safety research directions should be among the most promising we can think of today; this is not an argument for work on non-promising, but attractive "AI safety" research.)

Replaceability: HRAD work seems much more likely than other AI alignment work to be neglected by AI researchers and funders. If HRAD work turns out to be significantly helpful, we could make a significant counterfactual difference by supporting it.

Shovel-readiness: My understanding is that HRAD work is currently funding-constrained (i.e. MIRI could scale up its program given more funds). This is not generally true of technical AI safety work, which in my experience has also required significant staff time.

The difference in field-building value between HRAD and the other technical AI safety work we support makes me significantly more enthusiastic about supporting other technical AI safety work than about supporting HRAD. However, HRAD's low replaceability and my 10% credence in HRAD being useful make me excited to support at least some HRAD work.

In my view, enough HRAD work should be supported to continue building evidence about its chance of applicability to advanced AI, to have opportunities for other AI researchers to encounter it and become advocates, and to generally make it reasonably likely that if it is more important than it currently appears then we can learn this fact. MIRI's current size seems to me to be approximately right for this purpose, and as far as I know MIRI staff don't think MIRI is too small to continue making steady progress. Given this, I am ambivalent (along the lines of our previous grant writeup) about recommending that Good Ventures funds be used to increase MIRI's capacity for HRAD research.

60 Reactions

Mentioned in

192Collection of good 2012-2017 EA forum posts

More posts like this

Comments59

Sorted by

New & upvoted

Click to highlight new comments since: Today at 3:33 PM

So8res7y28

Thanks for this solid summary of your views, Daniel. For others’ benefit: MIRI and Open Philanthropy Project staff are in ongoing discussion about various points in this document, among other topics. Hopefully some portion of those conversations will be made public at a later date. In the meantime, a few quick public responses to some of the points above:

2) If we fundamentally "don't know what we're doing" because we don't have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.

3) Even minor mistakes in an advanced AI system's design are likely to cause catastrophic misalignment.

I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.

I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."

This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding "unknown unknown" problems.

We worry about "unknown unknowns", but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.

As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don't understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.

A big intuition behind prioritizing HRAD is that solutions to “how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?” are likely to be particularly difficult to hack together from scratch late in development. An incomplete (empirical-side-only) understanding of what it means to optimize objectives in realistic environments seems like it will force designers to rely more on guesswork and trial-and-error in a lot of key design decisions.

I haven't found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they'd otherwise face.

This seems reasonable to me in general. I’d say that AIXI has had limited influence in part because it’s combining several different theoretical insights that the field was already using (e.g., complexity penalties and backtracking tree search), and the synthesis doesn’t add all that much once you know about the parts. Sections 3 and 4 of MIRI's Approach provide some clearer examples of what I have in mind by useful basic theory: Shannon, Turing, Bayes, etc.

My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you're making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML. (I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems.)

Daniel_Dewey7y5

Thanks Nate!

The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."

This is particularly helpful to know.

We worry about "unknown unknowns", but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.

As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don't understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.

“how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?”

I feel like this fits with the "not about concrete failure modes" narrative that I believed before reading your comment, FWIW.

So8res7y7

Can you give an example or two of failure modes or "categories of failure modes that are easy to foresee" that you think are addressed by some HRAD topic? I'd thought previously that thinking in terms of failure modes wasn't a good way to understand HRAD research.

I want to steer clear of language that might make it sound like we’re saying:

X 'We can't make broad-strokes predictions about likely ways that AGI could go wrong.'
X 'To the extent we can make such predictions, they aren't important for informing research directions.'
X 'The best way to address AGI risk is just to try to advance our understanding of AGI in a general and fairly undirected way.'

The things I do want to communicate are:

All of MIRI's research decisions are heavily informed by a background view in which there are many important categories of predictable failure, e.g., 'the system is steering toward edges of the solution space', 'the function the system is optimizing correlates with the intended function at lower capability levels but comes uncorrelated at high capability levels', 'the system has incentives to obfuscate and mislead programmers to the extent it models its programmers’ beliefs and expects false programmer beliefs to result in it better-optimizing its objective function.’
The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., 'developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it's likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems'.
There usually isn't a simple relationship between a particular open problem and a particular failure mode, but if we thought there were no way to predict in advance any of the ways AGI systems can go wrong, or if we thought a very different set of failures were likely instead, we'd have different research priorities.

Daniel_Dewey7y3

My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you're making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML. (I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems.)

Just planting a flag to say that I'm thinking more about this so that I can respond well.

Wei Dai7y21

3c. Other research, especially "learning to reason from humans," looks more promising than HRAD (75%?)

From the perspective of an observer who can only judge from what's published online, I'm worried that Paul's approach only looks more promising than MIRI's because it's less "mature", having received less scrutiny and criticism from others. I'm not sure what's happening internally in various research groups, but the amount of online discussion about Paul's approach has to be at least an order of magnitude less than what MIRI's approach has received.

(Looking at the thread cited by Rob Bensinger, various people including MIRI people have apparently looked into Paul's approach but have not written down their criticisms. I've been trying to better understand Paul's ideas myself and point out some difficulties that others may have overlooked, but this is hampered by the fact that Paul seems to be the only person who is working on the approach and can participate on the other side of the discussion.)

I think Paul's approach is certainly one of the most promising approaches we currently have, and I wish people paid more attention to it (and/or wrote down their thoughts about it more), but it seems much too early to cite it as an example of an approach that is more promising than HRAD and therefore makes MIRI's work less valuable.

Paul_Christiano7y12

I agree with this basic point, but I think on the other side there is a large gap in concreteness that makes makes it much easier to usefully criticize my approach (I'm at the stage of actually writing pseudocode and code which we can critique).

So far I think that the problems in my approach will also appear for MIRI's approach. For example:

Solomonoff induction or logical inductors have reliability problems that are analogous to reliability problems for machine learning. So to carry out MIRI's agenda either you need to formulate induction differently, or you need to somehow solve these problems. (And as far as I can tell, the most promising approaches to this problem apply both to MIRI's version and the mainstream ML version.) I think Eliezer has long understood this problem and has alluded to it, but it hasn't been the topic of much discussion (I think largely because MIRI/Eliezer have so many other problems on their plates).
Capability amplification requires breaking cognitive work down into smaller steps. MIRI's approach also requires such a breakdown. Capability amplification is easier in a simple formal sense (that if you solve the agent foundations you will definitely solve capability amplification, but not the other way around).
I've given some concrete definitions of deliberation/extrapolation, and there's been public argument about whether they really capture human values. I think CEV has avoided those criticisms not because it solves the problem, but because it is sufficiently vague that it's hard to criticize along these lines (and there are sufficiently many other problems that this one isn't even at the top of the list). If you want to actually give a satisfying definition of CEV, I feel you are probably going to have to go down the same path that started with this post. I suspect Eliezer has some ideas for how to avoid these problems, but at this point those ideas have been subject to even less public discussion than my approach.

I agree there are further problems in my agenda that will be turned up by my discussion. But I'm not sure there are fewer such problems than for the MIRI agenda, since I think that being closer to concreteness may more than outweigh the smaller amount of discussion.

If you agree that many of my problems also come up eventually for MIRI's agenda, that's good news about the general applicability of MIRI's research (e.g. the reliability problems for Solomonoff induction may provide a good bridge between MIRI's work and mainstream ML), but I think it would also be a good reason to focus on the difficulties that are common to both approaches rather than to problems like decision theory / self-reference / logical uncertainty / naturalistic agents / ontology identification / multi-level world models / etc.

Wei Dai7y5

And as far as I can tell, the most promising approaches to this problem apply both to MIRI's version and the mainstream ML version.

I'm not sure which approaches you're referring to. Can you link to some details on this?

Capability amplification requires breaking cognitive work down into smaller steps. MIRI's approach also requires such a breakdown. Capability amplification is easier in a simple formal sense (that if you solve the agent foundations you will definitely solve capability amplification, but not the other way around).

I don't understand how this is true. I can see how solving FAI implies solving capability amplification (just emulate the FAI at a low level *), but if all you had was a solution that allows a specific kind of agent (e.g., with values well-defined apart from its implementation details) keep those values as it self-modifies, how does that help a group of short-lived humans who don't know their own values break down an arbitrary cognitive task and perform it safely and as well as an arbitrary competitor?

(* Actually, even this isn't really true. In MIRI's approach, an FAI does not need to be competitive in performance with every AI design in every domain. I think the idea is to either convert mainstream AI research into using the same FAI design, or gain a decisive strategic advantage via superiority in some set of particularly important domains.)

My understanding is, MIRI's approach is to figure out how to safely increase capability by designing a base agent that can make safe use of arbitrary amounts of computing power and can safely improve itself by modifying its own design/code. The capability amplification approach is to figure out how to safely increase capability by taking a short-lived human as the given base agent, making copies of it and and organize how the copies work together. These seem like very different problems with their own difficulties.

I think CEV has avoided those criticisms not because it solves the problem, but because it is sufficiently vague that it's hard to criticize along these lines (and there are sufficiently many other problems that this one isn't even at the top of the list).

I agree that in this area MIRI's approach and yours face similar difficulties. People (including me) have criticized CEV for being vague and likely very difficult to define/implement though, so MIRI is not exactly getting a free pass by being vague. (I.e., I assume Daniel already took this into account.)

But I'm not sure there are fewer such problems than for the MIRI agenda, since I think that being closer to concreteness may more than outweigh the smaller amount of discussion.

This seems like a fair point, and I'm not sure how to weight these factors either. Given that discussion isn't particularly costly relative to the potential benefits, an obvious solution is just to encourage more of it. Someone ought to hold a workshop to talk about your ideas, for example.

I think it would also be a good reason to focus on the difficulties that are common to both approaches

This makes sense.

Paul_Christiano7y3

On capability amplification:

MIRI's traditional goal would allow you to break cognition down into steps that we can describe explicitly and implement on transistors, things like "perform a step of logical deduction," "adjust the probability of this hypothesis," "do a step of backwards chaining," etc. This division does not need to be competitive, but it needs to be reasonably close (close enough to obtain a decisive advantage).

Capability amplification requires breaking cognition down into steps that humans can implement. This decomposition does not need to be competitive, but it needs to be efficient enough that it can be implemented during training. Humans can obviously implement more than transistors, the main difference is that in the agent foundations case you need to figure out every response in advance (but then can have a correspondingly greater reason to think that the decomposition will work / will preserve alignment).

I can talk in more detail about the reduction from (capability amplification --> agent foundations) if it's not clear whether it is possible and it would have an effect on your view.

On competitiveness:

I would prefer be competitive with non-aligned AI, rather than count on forming a singleton, but this isn't really a requirement of my approach. When comparing difficulty of two approaches you should presumably compare the difficulty of achieving a fixed goal with one approach or the other.

On reliability:

On the agent foundations side, it seems like plausible approaches involve figuring out how to peer inside the previously-opaque hypotheses, or understanding what characteristic of hypotheses can lead to catastrophic generalization failures and then excluding those from induction. Both of these seem likely applicable to ML models, though would depend on how exactly they play out.

On the ML side, I think the other promising approaches involve either adversarial training, ensembling / unanimous votes, which could be applied to the agent foundations problem.

Wei Dai7y3

I can talk in more detail about the reduction from (capability amplification --> agent foundations) if it's not clear whether it is possible and it would have an effect on your view.

Yeah, this is still not clear. Suppose we had a solution to agent foundations, I don't see how that necessarily helps me figure out what to do as H in capability amplification. For example the agent foundations solution could say, use (some approximation of) exhaustive search in the following way, with your utility function as the objective function, but that doesn't help me because I don't have a utility function.

When comparing difficulty of two approaches you should presumably compare the difficulty of achieving a fixed goal with one approach or the other.

My point was that HRAD potentially enables the strategy of pushing mainstream AI research away from opaque designs (which are hard to compete with while maintaining alignment, because you don't understand how they work and you can't just blindly copy the computation that they do without risking safety), whereas in your approach you always have to worry about "how do I compete with with an AI that doesn't have an overseer or has an overseer who doesn't care about safety and just lets the AI use whatever opaque and potentially dangerous technique it wants".

On the agent foundations side, it seems like plausible approaches involve figuring out how to peer inside the previously-opaque hypotheses, or understanding what characteristic of hypotheses can lead to catastrophic generalization failures and then excluding those from induction.

Oh I see. In my mind the problems with Solomonoff Induction means that it's probably not the right way to define how induction should be done as an ideal, so we should look for something kind of like Solomonoff Induction but better, not try to patch it by doing additional things on top of it. (Like instead of trying to figure out exactly when CDT would make wrong decisions and add more complexity on top of it to handle those cases, replace it with UDT.)

capybaralet7y0

My point was that HRAD potentially enables the strategy of pushing mainstream AI research away from opaque designs (which are hard to compete with while maintaining alignment, because you don't understand how they work and you can't just blindly copy the computation that they do without risking safety), whereas in your approach you always have to worry about "how do I compete with with an AI that doesn't have an overseer or has an overseer who doesn't care about safety and just lets the AI use whatever opaque and potentially dangerous technique it wants".

I think both approaches potentially enable this, but are VERY unlikely to deliver. MIRI seems more bullish that fundamental insights will yield AI that is just plain better (Nate gave me the analogy of Judea Pearl coming up with Causal PGMs as such an insight), whereas Paul just seems optimistic that we can get a somewhat negligible performance hit for safe vs. unsafe AI.

But I don't think MIRI has given very good arguments for why we might expect this; it would be great if someone can articulate or reference the best available arguments.

I have a very strong intuition that dauntingly large safety-performance trade-offs are extremely likely to persist in practice, thus the only answer to the "how do I compete" question seems to be "be the front-runner".

jsteinhardt7y9

Shouldn't this cut both ways? Paul has also spent far fewer words justifying his approach to others, compared to MIRI.

Personally, I feel like I understand Paul's approach better than I understand MIRI's approach, despite having spent more time on the latter. I actually do have some objections to it, but I feel it is likely to be significantly useful even if (as I, obviously, expect) my objections end up having teeth.

Wei Dai7y10

Shouldn't this cut both ways? Paul has also spent far fewer words justifying his approach to others, compared to MIRI.

The fact that Paul hasn't had a chance to hear from many of his (would-be) critics and answer them means we don't have a lot of information about how promising his approach is, hence my "too early to call it more promising than HRAD" conclusion.

I actually do have some objections to it, but I feel it is likely to be significantly useful even if (as I, obviously, expect) my objections end up having teeth.

Have you written down these objections somewhere? My worry is basically that different people looked at Paul's approach and each thought of a different set of objections, and they think, "that's not so bad", without knowing that there's actually a whole bunch of other objections out there, including additional ones that people would find if they thought and talked about Paul's ideas more.

Daniel_Dewey7y3

I think there's something to this -- thanks.

To add onto Jacob and Paul's comments, I think that while HRAD is more mature in the sense that more work has gone into solving HRAD problems and critiquing possible solutions, the gap seems much smaller to me when it comes to the justification for thinking HRAD is promising vs justification for Paul's approach being promising. In fact, I think the arguments for Paul's work being promising are more solid than those for HRAD, despite it only being Paul making those arguments -- I've had a much harder time understanding anything more nuanced than the basic case for HRAD I gave above, and a much easier time understanding why Paul thinks his approach is promising.

Wei Dai7y2

Daniel, while re-reading one of Paul's posts from March 2016, I just noticed the following:

[ETA: By the end of 2016 this problem no longer seems like the most serious.] ... [ETA: while robust learning remains a traditional AI challenge, it is not at all clear that it is possible. And meta-execution actually seems like the ingredient furthest from existing ML practice, as well as having non-obvious feasibility.]

My interpretation of this is that between March 2016 and the end of 2016, Paul updated the difficulty of his approach upwards. (I think given the context, he means that other problems, namely robust learning and meta-execution, are harder, not that informed oversight has become easier.) I wanted to point this out to make sure you updated on his update. Clearly Paul still thinks his approach is more promising than HRAD, but perhaps not by as much as before.

Wei Dai7y1

the gap seems much smaller to me when it comes to the justification for thinking HRAD is promising vs justification for Paul's approach being promising

This seems wrong to me. For example, in the "learning to reason from human" approaches, the goal isn't just to learn to reason from humans, but to do it in a way that maintains competitiveness with unaligned AIs. Suppose a human overseer disapproves of their AI using some set of potentially dangerous techniques, how can we then ensure that the resulting AI is still competitive? Once someone points this out, proponents of the approach, to continue thinking their approach is promising, would need to give some details about how they intend to solve this problem. Subsequently, justification for thinking the approach is promising is more subtle and harder to understand. I think conversations like this have occurred for MIRI's approach far more than Paul's, which may be a large part of why you find Paul's justifications easier to understand.

jsteinhardt7y5

This doesn't match my experience of why I find Paul's justifications easier to understand. In particular, I've been following MIRI since 2011, and my experience has been that I didn't find MIRI's arguments (about specific research directions) convincing in 2011*, and since then have had a lot of people try to convince me from a lot of different angles. I think pretty much all of the objections I have are ones I generated myself, or would have generated myself. Although, the one major objection I didn't generate myself is the one that I feel most applies to Paul's agenda.

( * There was a brief period shortly after reading the sequences that I found them extremely convincing, but I think I was much more credulous then than I am now. )

jsteinhardt7y6

I think the argument along these lines that I'm most sympathetic to is that Paul's agenda fits more into the paradigm of typical ML research, and so is more likely to fail for reasons that are in many people's collective blind spot (because we're all blinded by the same paradigm).

Wei Dai7y31

That actually didn't cross my mind before, so thanks for pointing it out. After reading your comment, I decided to look into Open Phil's recent grants to MIRI and OpenAI, and noticed that of the 4 technical advisors Open Phil used for the MIRI grant investigation (Paul Christiano, Jacob Steinhardt, Christopher Olah, and Dario Amodei), all either have a ML background or currently advocate a ML-based approach to AI alignment. For the OpenAI grant however, Open Phil didn't seem to have similarly engaged technical advisors who might be predisposed to be critical of the potential grantee (e.g., HRAD researchers), and in fact two of the Open Phil technical advisors are also employees of OpenAI (Paul Christiano and Dario Amodei). I have to say this doesn't look very good for Open Phil in terms of making an effort to avoid potential blind spots and bias.

jsteinhardt7y14

(Speaking for myself, not OpenPhil, who I wouldn't be able to speak for anyways.)

For what it's worth, I'm pretty critical of deep learning, which is the approach OpenAI wants to take, and still think the grant to OpenAI was a pretty good idea; and I can't really think of anyone more familiar with MIRI's work than Paul who isn't already at MIRI (note that Paul started out pursuing MIRI's approach and shifted in an ML direction over time).

That being said, I agree that the public write-up on the OpenAI grant doesn't reflect that well on OpenPhil, and it seems correct for people like you to demand better moving forward (although I'm not sure that adding HRAD researchers as TAs is the solution; also note that OPP does consult regularly with MIRI staff, though I don't know if they did for the OpenAI grant).

Wei Dai7y6

I can't really think of anyone more familiar with MIRI's work than Paul who isn't already at MIRI (note that Paul started out pursuing MIRI's approach and shifted in an ML direction over time).

The Agent Foundations Forum would have been a good place to look for more people familiar with MIRI's work. Aside from Paul, I see Stuart Armstrong, Abram Demski, Vadim Kosoy, Tsvi Benson-Tilsen, Sam Eisenstat, Vladimir Slepnev, Janos Kramar, Alex Mennen, and many others. (Abram, Tsvi, and Sam have since joined MIRI, but weren't employees of it at the time of the Open Phil grant.)

That being said, I agree that the public write-up on the OpenAI grant doesn't reflect that well on OpenPhil, and it seems correct for people like you to demand better moving forward

I had previously seen some complaints about the way the OpenAI grant was made, but until your comment, hadn't thought of a possible group blind spot due to a common ML perspective. If you have any further insights on this and related issues (like why you're critical of deep learning but still think the grant to OpenAI was a pretty good idea, what are your objections to Paul's AI alignment approach, how could Open Phil have done better), would you please write them down somewhere?

Kerry_Vaughan7y20

This was the most illuminating piece on MIRIs work and on AI Safety in general that I've read in some time. Thank you for publishing it.

Ben Pace7y10

Agreed! It was nice to see the clear output of someone who had spent a lot of time and effort into a good-faith understanding of the situation.

I was really happy with the layout of four key factors, this will help me have more clarity in further discussions.

Daniel_Dewey7y6

Thanks Kerry, Benito! Glad you found it helpful.

TaraMacAulay7y16

I know it's outside the scope of this writeup, but just wanted to say that I found this really helpful, and I'm looking forward to seeing an evaluation of MIRIs other research.

I'd also be really excited to see more posts about which research pathways you think are most promising in general, and how you compare work on field building, strategy and policy approaches and technical research.

Daniel_Dewey7y9

Thanks Tara! I'd like to do more writing of this kind, and I'm thinking about how to prioritize it. It's useful to hear that you'd be excited about those topics in particular.

MikeJohnson7y4

I too found this post very helpful/illuminating. I hope you can continue to do this sort of writing!

Daniel_Dewey7y3

Thanks!

Kaj_Sotala7y13

I haven't found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they'd otherwise face. [...] It seems plausible that the kinds of axiomatic descriptions that HRAD work could produce would be too taxing to be usefully applied to any practical AI system.

I wonder if slightly analogous example could be found in the design of concurrent systems.

As you may know, it's surprisingly difficult to design software that has multiple concurrent processes manipulating the same data. You typically either screw up by letting the processes edit the same data at the same time or in the wrong order, or by having them wait for each other forever.

So to help reason more clearly about this kind of thing, people developed different forms of temporal logic that let them express in a maximally unambiguous form different desiderata that they have for the system. Temporal logic lets you express statements that say things like "if a process wants to have access to some resource, it will eventually enter a state where it has access to that resource". You can then use temporal logic to figure out how exactly you want your system to behave, in order for it to do the things you want it to do and not run into any problems.

Building a logical model of how you want your system to behave is not the same thing as building the system. The logic only addresses one set of desiderata: there are many others it doesn't address at all, like what you want the UI to be like and how to make the system efficient in terms of memory and processor use. It's a model that you can use for a specific subset of your constraints, both for checking whether the finished system meets those constraints, and for building a system so that it's maximally easy for it to meet those constraints. Although the model is not a whole solution, having the model at hand before you start writing all the concurrency code is going to make things a lot easier for you than if you didn't have any clear idea of how you wanted the concurrent parts to work and were just winging it as you went.

So similarly, if MIRI developed HRAD into a sufficiently sophisticated form, it might yield a set of formal desiderata of how we want the AI to function, as well as an axiomatic model that can be applied to a part of the AI's design, to make sure everything goes as intended. But I would guess that it wouldn't really be a "complete axiomatic descriptions of" the system, in the way that temporal logics aren't a complete axiomatic description of modern concurrent systems.

Daniel_Dewey7y3

Thanks for this suggestion, Kaj -- I think it's an interesting comparison!

Peter Wildeford7y7

If one disagreed with an HRAD-style approach for whatever reason but still wanted to donate money to maximize AI safety, where should one donate? I assume the Far Future EA Fund?

Daniel_Dewey7y5

I am very bullish on the Far Future EA Fund, and donate there myself. There's one other possible nonprofit that I'll publicize in the future if it gets to the stage where it can use donations (I don't want to hype this up as an uber-solution, just a nonprofit that I think could be promising).

I unfortunately don't spend a lot of time thinking about individual donation opportunities, and the things I think are most promising often get partly funded through Open Phil (e.g. CHAI and FHI), but I think diversifying the funding source for orgs like CHAI and FHI is valuable, so I'd consider them as well.

LawrenceC7y5

Not super relevant to Peter's question, but I would be interested in hearing why you're bullish on the Far Future EA Fund.

WillPearson7y2

On the meta side of things:

I found ai impacts recently recently. There is a group I am loosely affiliated that is trying to make a MOOC about ai safety.

If you care about doing something about immense suffering risks (s-risks) you might like the foundational research institute.

There is an overview of other charities but it is more favourable of HRAD style papers.

I would like to set up an organisation that studies autonomy and our response to making more autonomous things (especially with regards to adminstrative autonomy). I have a book slowly brewing. So if you are interested in that get in contact.

Elityre5y5

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

1. It seems pretty likely that early advanced AI systems won't be understandable in terms of HRAD's formalisms, in which case HRAD won't be useful as a description of how these systems should reason and make decisions.

My current guess is that the finalized HRAD formalisms would be general enough that they will provide meaningful insight into early advanced AI systems (even supposing that the development of those early systems is not influenced by HRAD ideas), in much the same way that Pearlean causality and Bayes nets gives (a little) insight into what neural nets are doing.

Michael_Cohen7y5

MIRI's current size seems to me to be approximately right for this purpose, and as far as I know MIRI staff don't think MIRI is too small to continue making steady progress.

My guess is that this intuition is relatively inelastic to MIRI's size. It might be worth trying to generate the counterfactual intuition here if MIRI were half its size or double its size. If that process outputs a similar intuition, it might be worth attempting to forget how many people MIRI employs in this area, and ask how many people should be working on a topic that by your estimation has a 10% chance of being instrumental to an existential win. Though my number is higher than 10%, I think even if I had that estimate, my answer to the number of people that should be working on that topic would be "as many as are available."

WillPearson7y2

My criticism of the HRAD research project is that it has no empirical feedback mechanisms and that the ignored physical aspect of computation can have a large impact on the type of systems you think about and design.

I think people thinking highly formally about AI systems might be useful as long as the real world can be used to constrain their thinking

Daniel_Dewey7y4

Thanks for these thoughts. (Your second link is broken, FYI.)

On empirical feedback: my current suspicion is that there are some problems where empirical feedback is pretty hard to get, but I actually think we could get more empirical feedback on how well HRAD can be used to diagnose and solve problems in AI systems. For example, it seems like many AI systems implicitly do some amount of logical-uncertainty-type reasoning (e.g. AlphaGo, which is really all about logical uncertainty over the result of expensive game-tree computations) -- maybe HRAD could be used to understand how those systems could fail?

I'm less convinced that the "ignored physical aspect of computation" is a very promising direction to follow, but I may not fully understand the position you're arguing for.

WillPearson7y1

Fixed, thanks.

I agree that HRAD might be useful. I read some of the stuff. I think we need a mix of theory and practice and only when we have community where they can feed into each other will we actually get somewhere. When an AI safety theory paper says, "Here is an experiment we can do to disprove this theory," then I will pay more attention than I do.

The "ignored physical aspect of computation" is less about a direction to follow, but more an argument about the type of systems that are likely to be effective and so an argument about which ones we should study. There is no point studying how to make ineffective systems safe if the lessons don't carry over to effective ones.

You don't want a system that puts in the same computational resources trying to decide what brand of oil is best for its bearings as it does to deciding the question of what is a human or not. If you decide how much computational resources you want to put into each class of decision, you start to get into meta-decision territory. You also need to decide how much of your pool you want to put into making that meta-decision as making it will take away from making your other decisions.

I am thinking about a possible system which can allocate resources among decision making systems and this can be used to align the programs (at least somewhat). It cannot align a super intelligent malign program, work needs to done on the initial population of programs in the system, so that we can make sure they do not appear. Or we need a different way of allocating resources entirely.

I don't pick this path because it is an easy path to safety, but because I think it is the only path that leads anywhere interesting/dangerous and so we need to think about how to make it safe.

capybaralet7y0

Will - I think "meta-reasoning" might capture what you mean by "meta-decision theory". Are you familiar with this research (e.g. Nick Hay did a thesis w/Stuart Russell on this topic recently)?

I agree that bounded rationality is likely to loom large, but I don't think this means MIRI is barking up the wrong tree... just that other trees also contain parts of the squirrel.

LawrenceC7y0

My suspicion is that MIRI agrees with you - if you read their job post on their software engineering internship, it seems that they're looking for people who can rapidly prototype and test AI Alignment ideas that have implications in machine learning.

Kerry_Vaughan7y2

3c. Other research, especially "learning to reason from humans," looks more promising than HRAD (75%?)

I haven't thought about this in detail, but you might think that whether the evidence in this section justifies the claim in 3c might depend, in part, on what you think the AI Safety project is trying to achieve.

On first pass, the "learning to reason from humans" project seems like it may be able to quickly and substantially reduce the chance of an AI catastrophe by introducing human guidance as a mechanism for making AI systems more conservative.

However, it doesn't seem like a project that aims to do either of the following:

(1) Reduce the risk of an AI catastrophe to zero (or near zero) (2) Produce an AI system that can help create an optimal world

If you think either (1) or (2) are the goals of AI Safety, then you might not be excited about the "learning to reason from humans" project.

You might think that "learning to reason from humans" doesn't accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can't rule out AI catastrophes with high certainty unless you can "peer inside the machine" so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.

You might think that "learning to reason from humans" doesn't accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want "if we knew more, thought faster, were more the people we wished we were" etc. then the approval of actual humans might, at some point, cease to be helpful.

Paul_Christiano7y10

You might think that "learning to reason from humans" doesn't accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want "if we knew more, thought faster, were more the people we wished we were" etc. then the approval of actual humans might, at some point, cease to be helpful.

A human can spend an hour on a task, and train an AI to do that task in milliseconds.

Similarly, an aligned AI can spend an hour on a task, and train its successor to do that task in milliseconds.

So you could hope to have a sequence of nice AI's, each significantly smarter than the last, eventually reaching the limits of technology while still reasoning in a way that humans would endorse if they knew more and thought faster.

(This is the kind of approach I've outlined and am working on, and I think that most work along the lines of "learn from human reasoning" will make a similar move.)

RobBensinger7y9

FWIW, I don't think (1) or (2) plays a role in why MIRI researchers work on the research they do, and I don't think they play a role in why people at MIRI think "learning to reason from humans" isn't likely to be sufficient. The shape of the "HRAD is more promising than act-based agents" claim is more like what Paul Christiano said here:

As far as I can tell, the MIRI view is that my work is aimed at [a] problem which is not possible, not that it is aimed at a problem which is too easy. [...] One part of this is the disagreement about whether the overall approach I'm taking could possibly work, with my position being "something like 50-50" the MIRI position being "obviously not" [...]

There is a broader disagreement about whether any "easy" approach can work, with my position being "you should try the easy approaches extensively before trying to rally the community behind a crazy hard approach" and the MIRI position apparently being something like "we have basically ruled out the easy approaches, but the argument/evidence is really complicated and subtle."

With a clarification I made in the same thread:

I think Paul's characterization is right, except I think Nate wouldn't say "we've ruled out all the prima facie easy approaches," but rather something like "part of the disagreement here is about which approaches are prima facie 'easy.'" I think his model says that the proposed alternatives to MIRI's research directions by and large look more difficult than what MIRI's trying to do, from a naive traditional CS/Econ standpoint. E.g., I expect the average game theorist would find a utility/objective/reward-centered framework much less weird than a recursive intelligence bootstrapping framework. There are then subtle arguments for why intelligence bootstrapping might turn out to be easy, which Nate and co. are skeptical of, but hashing out the full chain of reasoning for why a daring unconventional approach just might turn out to work anyway requires some complicated extra dialoguing. Part of how this is framed depends on what problem categories get the first-pass "this looks really tricky to pull off" label.

Daniel_Dewey7y2

Thanks for linking to that conversation -- I hadn't read all of the comments on that post, and I'm glad I got linked back to it.

Daniel_Dewey7y4

I'm going to try to answer these questions, but there's some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:

I'm glad Rob sketched out his reasoning on why (1) and (2) don't play a role in MIRI's thinking. That fits with my understanding of their views.

(1) You might think that "learning to reason from humans" doesn't accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can't rule out AI catastrophes with high certainty unless you can "peer inside the machine" so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.

My current take on this is that whatever we do, we're going to fall pretty far short of proof-strength "extremely high certainty" -- the approaches I'm familiar with, including HRAD, are after some mix of

a basic explanation of why an AI system designed a certain way should be expected to be aligned, corrigible, or some mix or other similar property
theoretical and empirical understanding that makes us think that an actual implementation follows that story robustly / reliably

HRAD makes trade-offs than other approaches do, and it does seem to me like successfully-done HRAD would be more likely to be amenable to formal arguments that cover some parts of our confidence gap, but it doesn't look to me like "HRAD offers proof-level certainty, other approaches offer qualitatively less".

(2) Produce an AI system that can help create an optimal world... You might think that "learning to reason from humans" doesn't accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want "if we knew more, thought faster, were more the people we wished we were" etc. then the approval of actual humans might, at some point, cease to be helpful.

It's true that I'm more focused on "make sure human values keep steering the future" than on the direct goal of "optimize the world"; I think that making sure human values keep steering the future is the best leverage point for creating an optimal world.

My hope is that for some decisions, actual humans (like us) would approve of "make this decision on the basis of something CEV-like -- do things we'd approve of if we knew more, thought faster, etc., where those approvals can be predicted with high confidence, don't pose super-high risk of lock-in to a suboptimal future, converge among different people, etc." If you and I think this is a good idea, it seems like an AI system trained on us could think this as well.

Another way of thinking about this is that the world is currently largely steered by human values, AI threatens to introduce another powerful steering force, and we're just making sure that that power is aligned with us at each timestep. A not-great outcome is that we end up with the world humans would have made if AI were not possible in the first place, but we don't get toward optimality very quickly; a more optimistic outcome is that the additional steering power accelerates us very significantly along the track to an optimal world, steered by human values along the way.

Kerry_Vaughan2y1

Review for the Decade Review

One of the most straightforward and useful introductions to MIRIs work that I've read.

capybaralet7y1

My main comments:

As others have mentioned: great post! Very illuminating!
I agree value-learning is the main technical problem, although I’d also note that value-learning related techniques are becoming much more popular in mainstream ML these days, and hence less neglected. Stuart Russell has argued (and I largely agree) that things like IRL will naturally become a more popular research topic (but I’ve also argued this might not be net-positive for safety: http://lesswrong.com/lw/nvc/risks_from_approximate_value_learning/)
My main comment wrt the value of HRAD (3a) is: I think HRAD-style work is more about problem definitions than solutions. So I find it to be somewhat orthogonal to the other approach of “learning to reason from humans” (L2R). We don’t have the right problem definitions, at the moment; we know that the RL framework is a leaky abstraction. I think MIRI has done the best job of identifying the problems which could result from our current leaky abstractions, and working to address them by improving our understanding of what problems need to be solved.
It’s also not clear that human reasoning can be safely amplified; the relative safety of existing humans may be due to our limited computational / statistical resources, rather than properties of our cognitive algorithms. But this argument is not as strong as it seems; see comment #3 below.

A few more comments:

RE 3b: I don’t really think the AI community’s response to MIRI’s work is very informative, since it’s just not on people’s radar. The problems and not well known or understood, and the techniques are (AFAIK) not very popular or in vogue (although I’ve only been in the field for 4 years, and only studied machine-learning based approaches to AI). I think decision theory was already a relatively well known topic in philosophy, so I think philosophy would naturally be more receptive to these results.
I’m unconvinced about the feasibility of Paul’s approach**, and share Wei Dai’s concerns about it hinging on a high level of competitiveness. But I also think HRAD suffers from the same issues of competitiveness (this does not seem to be MIRI’s view, which I’m confused by). This is why I think solving global coordination is crucial.
A key missing (assumed?) argument here is that L2R can be a stepping stone, e.g. providing narrow or non-superintelligent AI capabilities which can be applied to AIS problems (e.g. making much more progress on HRAD than MIRI). To me this is a key argument for L2R over HRAD, and generally a source of optimism. I’m curious if this argument plays a significant role in your thought; in other words, is it that HRAD problems don’t need to be solved, or just that the most effective solution path goes through L2R? I’m also curious about the counter-argument for pursuing HRAD now: i.e. what role does MIRI anticipate safe advanced (but not general / superhuman) intelligent systems to play in HRAD?
An argument for more funding for MIRI which isn’t addressed is the apparent abundance of wealth at the disposal of Good Ventures. Since funding opportunities are generally scarce in AI Safety, I think every decent opportunity should be aggressively pursued. There are 3 plausible arguments I can see for the low amount of funding to MIRI: 1) concern of steering other researchers in unproductive directions 2) concern about bad PR 3) internal politics.
Am I correct that there is a focus on shorter timelines (e.g. <20 years)?

Briefly, my overall perspective on the future of AI and safety relevance is:

There ARE fundamental insights missing, but they are unlikely to be key to building highly capable OR safe AI.
Fundamental insights might be crucial for achieving high confidence in a putatively safe AI (but perhaps not for developing an AI which is actually safe).
HRAD line of research is likely to uncover mostly negative results (ala AIXI’s arbitrary dependence on prior)
Theory is behind empiricism, and the gap is likely to grow; this is the main reason I’m a bit pessimistic about theory being useful. On the other hand, I think most paths to victory involve using capability-control for as long as possible while transitioning to completely motivation-control based approaches, so conditioning on victory, it seems more likely that we solve more fundamental problems (i.e. “we have to solve these problems eventually”).

** the two main reasons are: 1) I don’t think it will be competitive and 2) I suspect it will be difficult to prevent compounding errors in a bootstrapping process that yields superintelligent agents.

JesseClifton7y1

Great piece, thank you.

Regarding "learning to reason from humans", to what extent do you think having good models of human preferences is a prerequisite for powerful (and dangerous) general intelligence?

Of course, the motivation to act on human preferences is another matter - but I wonder if at least the capability comes by default?

Daniel_Dewey7y0

My guess is that the capability is extremely likely, and the main difficulties are motivation and reliability of learning (since in other learning tasks we might be satisfied with lower reliability that gets better over time, but in learning human preferences unreliable learning could result in a lot more harm).

WillPearson7y0

My own 2 cents. It depends a bit what form of general intelligence is made first. There are at least two possible models.

Super intelligent agent with a specified goal
External brain lobe

With the first you need to be able to specify a human preferences in the form of a goal. Which enables it to pick the right actions.

The external brain lobe would start not very powerful and not come with any explicit goals but would be hooked into the human motivational system and develop goals shaped by human preferences.

HRAD is explicitly about the first. I would like both to be explored.

JesseClifton7y0

Right, I'm asking how useful or dangerous your (1) could be if it didn't have very good models of human psychology - and therefore didn't understand things like "humans don't want to be killed".

Tobias_Baumann7y1

Great post! I agree with your overall assessment that other approaches may be more promising than HRAD.

I'd like to add that this may (in part) depend on our outlook on which AI scenarios are likely. Conditional on MIRI's view that a hard or unexpected takeoff is likely, HRAD may be more promising (though it's still unclear). If the takeoff is soft or AI will be more like the economy, then I personally think HRAD is unlikely to be the best way to shape advanced AI.

(I wrote a related piece on strategic implications of AI scenarios.)

Daniel_Dewey7y2

Thanks!

Conditional on MIRI's view that a hard or unexpected takeoff is likely, HRAD is more promising (though it's still unclear).

Do you mean more promising than other technical safety research (e.g. concrete problems, Paul's directions, MIRI's non-HRAD research)? If so, I'd be interested in hearing why you think hard / unexpected takeoff differentially favors HRAD.

Tobias_Baumann7y1

Do you mean more promising than other technical safety research (e.g. concrete problems, Paul's directions, MIRI's non-HRAD research)?

Yeah, and also (differentially) more promising than AI strategy or AI policy work. But I'm not sure how strong the effect is.

If so, I'd be interested in hearing why you think hard / unexpected takeoff differentially favors HRAD.

In a hard / unexpected takeoff scenario, it's more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system.

In contrast, if we think there's no such discontinuity and AI development will be gradual, then AI control may be at least somewhat more similar (but surely not entirely comparable) to how we "align" contemporary software systems. That is, it would be more plausible that we could test advanced AI systems extensively without risking catastrophic failure or that we could iteratively try a variety of safety approaches to see what works best.

It would also be more likely that we'd get warning signs of potential failure modes, so that it's comparatively more viable to work on concrete problems whenever they arise, or to focus on making the solutions to such problems scalable – which, to my understanding, is a key component of Paul's approach. In this picture, successful alignment without understanding the theoretical fundamentals is more likely, which makes non-HRAD approaches more promising.

My personal view is that I find a hard and unexpected takeoff unlikely, and accordingly favor other approaches than HRAD, but of course I can't justify high confidence in this given expert disagreement. Similarly, I'm not highly confident that the above distinction is actually meaningful.

I'd be interested in hearing your thoughts on this!

AlexMennen7y2

There's a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage. In that case, the fact that it takes the AI a long time to achieve a decisive strategic advantage wouldn't do us much good, since we would not pick up an indication that anything was amiss during that period.

Reasons an AI might act in a desirable manner before but not after achieving a decisive strategic advantage:

Prior to achieving a decisive strategic advantage, the AI relies on cooperation with humans to achieve its goals, which provides an incentive not to act in ways that would result in it getting shut down. An AI may be capable of following these incentives well before achieving a decisive strategic advantage.

It may be easier to give an AI a goal system that aligns with human goals in familiar circumstances than it is to give it a goal system that aligns with human goals in all circumstances. An AI with such a goal system would act in ways that align with human goals if it has little optimization power but in ways that are not aligned with human goals if it has sufficiently large optimization power, and it may attain that much optimization power only after achieving a decisive strategic advantage (or before achieving a decisive strategic advantage, but after acquiring the ability to behave deceptively, as in the previous reason).

Kaj_Sotala7y4

There's a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage.

That's assuming that the AI is confident that it will achieve a DSA eventually, and that no competitors will do so first. (In a soft takeoff it seems likely that there will be many AIs, thus many potential competitors.) The worse the AI thinks its chances are of eventually achieving a DSA first, the more rational it becomes for it to risk non-cooperative action at the point when it thinks it has the best chances of success - even if those chances were low. That might help reveal unaligned AIs during a soft takeoff.

Interestingly this suggests that the more AIs there are, the easier it might be to detect unaligned AIs (since every additional competitor decreases any given AI's odds of getting a DSA first), and it suggests some unintuitive containment strategies such as explicitly explaining to the AI when it would be rational for it to go uncooperative if it was unaligned, to increase the odds of unaligned AIs really risking hostile action early on and being discovered...

Daniel_Eth7y0

Or it could just assume the AI has an unbounded utility function (or bounded very highly). An AI could guess it only has a 1 in 1/B chance of reaching DSA, but that the payoff from reaching this is 100B higher than defecting early. Since there are 100B stars in the galaxy, it seems likely that in a multipolar situation with decent diversity of AIs, some would fulfill this criteria and decide to gamble.

Daniel_Dewey7y1

Thanks Tobias.

In a hard / unexpected takeoff scenario, it's more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system.

FWIW, I'm not ready to cede the "more principled" ground to HRAD at this stage; to me, it seems like the distinction is more about which aspects of an AI system's behavior we're specifying manually, and which aspects we're setting it up to learn. As far as trying to get everything right the first time, I currently favor a corrigibility kind of approach, as I described in 3c above -- I'm worried that trying to solve everything formally ahead of time will actually expose us to more risk.

capybaralet7y0

it's cross-posted on LW: http://lesswrong.com/lw/p85/daniel_dewey_on_miris_highly_reliable_agent/