A challenge for AGI organizations, and a challenge for readers

RobBensinger

(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)

Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.

Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.

We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.

Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.^[1]

Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.

It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.

We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.

We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".

Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.

Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.^[2]^[3]

^{^}
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
^{^}
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
^{^}
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."

168 Reactions

More posts like this

Comments13

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:37 AM

Peter Wildeford1y15

We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!

Why not run a draft of the post by them? Not sure what you had to lose there and seems like it could've been better (both from a politeness/cooperativeness perspective and from a tactical perspective) to have done so.

Habryka1y57

I continue to object to a norm of running posts by organizations that the posts are talking about. From many interviews with posters to LW and the EA Forum over the years I know that the chilling effects would be massive, and this norm has already multiple times prevented important things from being said, because it doubled or tripled the cost of publishing things that talk about organizations.

RobBensinger1y8

Yeah, I agree with this too. I don't think MIRI staff are scared to poke DM about things, but I like taking opportunities to make it clear "it's OK to talk about MIRI, DM, etc. without checking in with us privately first", because I expect that a lot of people with good thoughts and questions will get stuck on scenarios like 'intimidated by the idea of shooting MIRI an email', 'doesn't know who to contact at MIRI', 'doesn't want to deal with the hassle of an email back-and-forth', etc.

I think it's good to have 'send drafts to the org in advance' as an option that feels available to you. I just don't want it to feel like a requirement.

(It also seems fine to me to send posts about MIRI to me after posting them. This makes it less likely that I just don't notice the post exists, and gives me a chance to respond while the post is fresh and people are paying attention to it, while reducing the risk that good thoughts just never get posted.)

RobBensinger1y25

If folks at DM/Anthropic/OpenAI ask us to run this kind of thing by them in advance, I assume we'll be happy to do so; we've sent them many other drafts of things before, and I expect we'll send them many more in the future.

I do like the idea of MIRI staff regularly or semi-regularly sharing our thoughts about things without running them by a bunch of people -- e.g., to encourage more of the conversation, pushback, etc. to happen in public, so information doesn't end up all bottled up in a few brains on a private email thread.

I think there are many cases where it's actively better for EAs to screw up in public and be corrected in the comments, rather than working out all disagreements and info-asymmetries in private channels and then putting out an immaculate, smoothed-over final product. (Especially if the post is transparent about this, so we have more-polished and less-polished stuff and it's pretty clear which is which.)

Screwing up in public has real costs (relative to the original essay Just Being Correct about everything), but hiding all the cognitive work that goes into consensus-building and airing of disagreements has real costs too.

This is not me coming out against running drafts by people in general; it's great tech, and we should use it. I just think there are subtle advantages to "just say what's on your mind and have a back-and-forth with people who disagree" that are worth keeping in view too.

Part of it is a certain attitude that I want to encourage more in EA, that I'm not sure how to put into words, but is something like: tip-toeing less; blurting more; being bolder, and proactively doing things-that-seem-good-to-you-personally rather than waiting for elite permission/encouragement/management; trying less to look perfect, and more to do the epistemically cooperative thing "wear your exact strengths and weaknesses on your sleeve so others can model you well"; etc.

All of that is compatible with running drafts by folks, but I think it can be valuable for more EAs to visibly be more relaxed (on the current margin) about stuff like draft-sharing, to contribute to a social environment where people feel chiller about making public mistakes, stating their current impressions and updating them in real time, etc. I don't think we want maximum chillness, but I think we want EA's best and brightest to be more chill on the current margin.

David Johnston1y6

I don’t think this makes sense. Your group, in the EA community, regarding AI safety, gets taken seriously whatever you write. This in not the paradigmatic example of someone who feels worried about making public mistakes. A community that gives you even more leeway to do sloppy work is not one that encourages more people to share their independent thoughts about the problem. In fact, I think the reverse is true: when your criticisms carry a lot of weight even when they’re flawed, this has a stifling effect on people in more marginal positions who disagree with you.

If you want to promote more open discussion, your time would be far better spent seeking out flawed but promising work by lesser known individuals and pointing out what you think is valuable in it.

Am I correct in my belief that you are paid to do this work? If this is so, then I think the fact that you are both highly regarded and compensated for your time means your output should meet higher standards than a typical community post. Contacting the relevant labs is a step that wouldn’t take you much time, can’t be done by the vast majority of readers, and has a decent chance of adding substantial value. I think you should have done it.

RobBensinger1y5

What sort of substantial value would you expect to be added? It sounds like we either have a different belief about the value-add, or a different belief about the costs. Maybe if you sketched 2-3 scenarios that strike you as a relatively likely way for this particular post to have benefited from private conversations, I'd know better what the shape of our disagreement is.

If your objection is less "this particular post would benefit" and more "every post that discusses an AGI org should run a draft by that org (at least if you're doing EA work full-time)", then I'd respond that stuff like "EAs candidly arguing about things back and forth in the comments of a post", the 80K Podcast, and unredacted EA chat logs are extremely valuable contributions to EA discourse, and I think we should do far, far more things like that on the current margin.

Writing full blog posts that are likewise "real" and likewise "part of a genuine public dialogue" can be valuable in much the same way; and some candid thoughts are a better fit for this format than for other formats, since some candid thoughts are more complicated, etc.

It's also important that intellectual progress like "long unedited chat logs" gets distilled and turned into relatively short, polished, and stable summaries; and it's also important that people feel free to talk in private. But having some big chunks of the intellectual process be out in public is excellent for a variety of reasons. Indeed, I'd say that there's more value overall in seeing EAs' actual cognitive processes than in seeing EAs' ultimate conclusions, when it comes to the domains that are most uncertain and disagreement-heavy (which include a lot of the most important domains for EAs to focus on today, in my view).

This in not the paradigmatic example of someone who feels worried about making public mistakes. A community that gives you even more leeway to do sloppy work is not one that encourages more people to share their independent thoughts about the problem.

I don't think that sharing in-process snapshots of your views is "sloppy", in the sense of representing worse epistemic standards than a not-in-process Finished Product.

E.g., I wouldn't say that a conversation on the 80K Podcast is more epistemically sloppy than a summary of people's take-aways from the conversation. I think the opposite is often true, and people's in-process conversations often reflect higher epistemic standards than their attempts to summarize and distill everything after-the-fact.

In EA, being good at in-process, uncertain, changing, under-debate reasoning is more the thing I want to lead by example on. I think that hiding process is often setting a bad example for EAs, and making it harder for them to figure out what's true.

I agree that I'm not a paradigmatic example of the EAs who most need to hear this lesson; but I think non-established EAs heavily follow the example set by established EAs, so I want to set an example that's closer to what I actually want to see more.

In fact, I think the reverse is true: when your criticisms carry a lot of weight even when they’re flawed, this has a stifling effect on people in more marginal positions who disagree with you.

If my reasoning process is actually flawed, then I want other EAs to be aware of that, so they can have an accurate model of how much weight to put on my views.

If established EAs in general have such flawed reasoning processes (or such false beliefs) that rank-and-file EAs would be outraged and give up on the EA community if they knew this fact, then we should want to outrage rank-and-file EAs, in the hope that they'll start something else that's new and better. EA shouldn't pretend to be better than it is; this causes way too many dysfunctions, even given that we're unusually good in a lot of ways.

(But possibly we agree about all that, and the crux here is just that you think sharing rougher or more uncertain thoughts is an epistemically bad practice, and I think it's an epistemically good practice. So you see yourself as calling for higher standards, and I see you as calling for standards that are actually lower but happen to look more respectable.)

If you want to promote more open discussion, your time would be far better spent seeking out flawed but promising work by lesser known individuals and pointing out what you think is valuable in it.

That seems like a great idea to me too! I'd advocate for doing this along with the things I proposed above.

Contacting the relevant labs is a step that wouldn’t take you much time, can’t be done by the vast majority of readers

Is that actually true? Seems maybe true, but I also wouldn't be surprised if >50% of regular EA Forum commenters can get substantive replies pretty regularly from knowledgeable DeepMind, OpenAI, and Anthropic staff, if they try sending a few emails.

David Johnston1y11

What sort of substantial value would you expect to be added? It sounds like we either have a different belief about the value-add, or a different belief about the costs.

I'd be very surprised if the actual amount of big-picture strategic thinking at either organisation was "very little". I'd be less surprised if they didn't have a consensus view about big-picture strategy, or a clearly written document spelling it out. If I'm right, I think the current content is misleading-ish. If I'm wrong and actually little thinking has been done - there's some chance they say "we're focused on identifying and tackling near-term problems", which would be interesting to me given what I currently believe. If I'm wrong and something clear has been written, then making this visible (or pointing out its existence) would also be a useful update for me.

Polished vs sloppy

Here are some dimensions I think of as distinguishing sloppy from polished:

Vague hunches <-> precise theories
First impressions <-> thorough search for evidence/prior work
Hard <-> easy to understand
Vulgar <-> polite
Unclear <-> clear account of robustness, pitfalls and so forth

All else equal, I don't think the left side is epistemically superior. It can be faster, and that might be worth it, but there are obvious epistemic costs to relying on vague hunches, first impressions, failures of communication and overlooked pitfalls (politeness is perhaps neutral here). I think these costs are particularly high in, as you say, domains that are uncertain and disagreement-heavy.

I think it is sloppy to stay too close to the left if you think the issue is important and you have time to address it properly. You have to manage your time, but I don't think there are additional reasons to promote sloppy work.

You say that there are epistemic advantages to exposing thought processes, and you give the example of dialogues. I agree there are pedagogical advantages to exposing thought processes, but exposing thoughts clearly also requires polish, and I don't think pedagogy is a high priority most of the time. I'd be way more excited to see more theory from MIRI than more dialogues.

If my reasoning process is actually flawed, then I want other EAs to be aware of that, so they can have an accurate model of how much weight to put on my views.

I don't think it's realistic to expect Lightcone forums to do serious reviews of difficult work. That takes a lot of individual time and dedication; maybe you occasionally get lucky, but you should mostly expect not to.

I agree that I'm not a paradigmatic example of the EAs who most need to hear this lesson [of exposing the thought process]; but I think non-established EAs heavily follow the example set by established EAs, so I want to set an example that's closer to what I actually want to see more of

Maybe I'll get into this more deeply one day, but I just don't think sharing your thoughts freely is a particularly effective way to encourage other people to share theirs. I think you've been pretty successful at getting the "don't worry about being polite to OpenAI" message across, less so the higher level stuff.

RobBensinger1y4

I agree with a lot of what you say! I still want to move EA in the direction of "people just say what's on their mind on the EA Forum, without trying to dot every i and cross every t; and then others say what's on their mind in response; and we have an actual back-and-forth that isn't carefully choreographed or extremely polished, but is more like a real conversation between peers at an academic conference".

(Another way to achieve many of the same goals is to encourage more EAs who disagree with each other to regularly talk to each other in private, where candor is easier. But this scales a lot more poorly, so it would be nice if some real conversation were happening in public.)

A lot of my micro-decisions in making posts like this are connected to my model of "what kind of culture and norms are likely to result in EA solving the alignment problem (or making a lot of progress)?", since I think that's the likeliest way that EA could make a big positive difference for the future. In that context, I think building conversations about heavily polished, "final" (rather than in-process) cognition, tends to be insufficient for fast and reliable intellectual progress:

Highly polished content tends to obscure the real reasons and causes behind people's views, in favor of reasons that are more legible, respectable, impressive, etc. (See Beware defensibility.)
- AGI alignment is a pre-paradigmatic proto-field where making good decisions will probably depend heavily on people having good technical intuitions, intuiting patterns before they know how to verbalize those patterns, and generally becoming adept at noticing what their gut says about a topic and putting their gut in contact with useful feedback loops so it can update and learn.
- In that context, I'm pretty worried about an EA where everyone is hyper-cautious about saying anything that sounds subjective, "feelings-ish", hard-to-immediately-transmit-to-others, etc. That might work if EA's path to improving the world is via donating more money to AMF or developing better vaccine tech, but it doesn't fly if making (and fostering) conceptual progress on AI alignment is the path to impact.
- Ideally, it shouldn't merely be the case that EA technically allows people to candidly blurt out their imperfect, in-process thoughts about things. Rather, EA as a whole should be organized around making this the expected and default culture (at least to the degree that EAs agree with me about AI being a top priority), and this should be reflected in a thousand small ways in how we structure our conversation. Normal EA Forum conversations should look more like casual exchanges between peers at an academic conference, and less like polished academic papers (because polished academic papers are too inefficient a vehicle for making early-stage conceptual progress).
- I think this is not only true for making direct AGI alignment progress, but is also true for converging about key macrostrategy questions (hard vs. soft takeoff; overall difficulty of the alignment; probability of a sharp left turn; impressiveness of GPT-3; etc.). Insofar as we haven't already converged a lot on these questions, I think a major bottleneck is that we've tried too much to make our reasoning sound academic-paper-ish before it's really in that format, with the result that we confuse ourselves about our real cruxes, and people end up updating a lot less than they would in a normal back-and-forth.
Highly polished, heavily privately reviewed and edited content tends to reflect the beliefs of larger groups, rather than the beliefs of a specific individual.
- This often results in deference cascades, double-counting evidence, and herding: everyone is trying (to some degree) to bend their statements in the direction of what everyone else thinks. I think it also often creates "phantom updates" in EA, where there's a common belief that X is widely believed, but the belief is wrong to some degree (at least until everyone updates their outside views because they think other EAs believe X).
- It also has various directly distortionary effects (e.g., a belief might seem straightforwardly true to all the individuals at an org, but doesn't feel like "the kind of thing" an organization writ large should endorse).

In principle, it's not impossible to push EA in those directions while also passing drafts a lot more in private. But I hope it's clearer why that doesn't seem like the top priority to me (and why it could be at least somewhat counter-productive) given that I'm working with this picture of our situation.

I'm happy to heavily signal-boost replies from DM and Anthropic staff (including editing the OP), especially if it shows that MIRI was just flatly wrong about how much those orgs already have a plan. And I endorse people docking MIRI points insofar as we predicted wrongly, here; and I'd prefer the world where people knew our first-order impressions of where the field's at in this case, and were able to dock us some points if we turn out to be wrong, as opposed to the world where everything happens in private.

(I think I still haven't communicated fully why I disagree here, but hopefully the pieces I have been able to articulate are useful on their own.)

lauren1y3

this approach to reasoning assumes authorities are valid. do not trust organizations this way. It is one of effective altruism's key failings. how can we increase pro-social distrust in effective altruism so that authorities are not trusted?

pseudonym1y2

I would be curious to hear the pushbacks from people who disagree-voted this!

Ofer1y4

From a cooperativeness perspective, people probably should not unilaterally create for-profit AGI companies.

(Note: Anthropic is a for-profit company that raised $704M according to Crunchbase, and is looking for engineers who want to build "large scale ML systems", but I wouldn't call them an "AGI company".)

RobBensinger1y4

Well, I wouldn't say that MIRI decided not to send drafts to DM etc. out of revenge, to punish them for making a strategic decision that seems extremely bad to me. What I'd say is that the norm 'savvy people freely talk about mistakes they think AGI orgs are making, without a bunch of friction' tends to save the world more often than the norm 'savvy people are unusually cautious about criticizing AGI orgs' does.

Indeed, I'd say this regardless of whether it was a good idea for someone to found the relevant AGI orgs in the first place. (I think it was a bad idea to create DM and to create OpenAI, but I don't think it's always a bad idea to make an AGI org, since that would be tantamount to saying that humanity should never build AGI.)

And we aren't totally helpless to follow the more world-destroying norm just because we think other people expect us to follow it; we can notice the problem and act to try to fix it, rather than contributing to a norm that isn't good. The pool of people who need to deliberately select the more-reasonable norm is not actually that large; it's a smallish professional network, not a giant slice of society.

kokotajlod1y4

I'm happy to see OpenAI and OpenAI Alignment Team get recognition/credit for having a plan and making it public. Well deserved I'd say. (ETA: To be clear, like the OP I don't currently expect the plan to work as stated; I expect us to need to pivot eventually & hope a better plan comes along before then!)