OCB

Owen Cotton-Barratt

9056 karmaJoined

Sequences
3

Reflection as a strategic goal
On Wholesomeness
Everyday Longermism

Comments
798

Topic contributions
3

While not providing anything like a solution to the central issue here, I want to note that it looks likely to be the middle classes that get hollowed out first -- human labour to do all kinds of physical tasks is likely to be valued for longer than various kinds of desk-based tasks, because scaling up and deploying robotics to replace them would take significant time, whereas scaling up the automation of desk-based tasks can be relatively quick.

Thanks for exploring this, I found it quite interesting. 

I'm worried that casual readers might come away with the impression "these dynamics of compensation for safety work being a big deal obviously apply to AI risk". But I think this is unclear, because we may not have the key property  (that you call assumption (b)). 

Intuitively I'd describe this property as "meaningful restraint", i.e. people are holding back a lot from what they might achieve if they weren't worried about safety. I don't think this is happening in the world at the moment. It seems plausible that it will never happen -- i.e. the world will be approximately full steam ahead until it gets death or glory. In this case there is no compensation effect, and safety work is purely good in the straightforward way.

To spell out the scenario in which safety work now could be bad because of risk compensation: perhaps in the future everyone is meaningfully restrained, but if there's been more work on how to build things safely done ahead of time, they're less worried so less restrained. I think this is a realistic possibility. But I think that this world is made much safer by less variance in the models of different actors about how much risk there is, in order to avoid having the actor who is an outlier in not expecting risk being the one to press ahead. Relatedly, I think we're much more likely to reach such a scenario if many people have got on a similar page about the levels of risk. But I think that a lot of "technical safety" work at the moment (and certainly not just "evals") is importantly valuable for helping people to build common pictures of the character of risk, and how high risk levels are with various degrees of safety measure. So a lot of what people think of as safety work actually looks good even in exactly the scenario where we might get >100% risk compensation.

All of this isn't to say "risk compensation shouldn't be a concern", but more like "I think we're going to have to model this in more granularity to get a sense of when it might or might not be a concern for the particular case of technical AI safety work".

A small point of confusion: taking U(C) = C (+ a constant) by appropriate parametrization of C is an interesting move. I'm not totally sure what to think of it; I can see that it helps here, but it makes it seem quite hard work to develop good intuitions about the shape of P. But the one clear intuition I have about the shape of P is that there should be some C>0 where P is 0, regardless of S, because there are clearly some useful applications of AI which pose no threat of existential catastrophe. But your baseline functional form for P excludes this possibility. I'm not sure how much this matters, because as you say the conclusions extend to a much broader class of possible functions (not all of which exclude this kind of shape), but the tension makes me want to check I'm not missing something?

Maybe? It seems a bit extreme for that; I think 5/6 of the "disagree" votes came in over a period of an hour or two mid-evening UK time. But it could certainly just be coincidence, or a group of people happening to discuss it and all disagree, or something.

OK actually there's been a funny voting pattern on my top-level comment here, where I mostly got a bunch of upvotes and agree-votes, and then a whole lot of downvotes and disagree-votes in one cluster, and then mostly upvotes and agree-votes since then. Given the context, I feel like I should be more open than usual to a "shenanigans" hypothesis, which feels like it would be modest supporting evidence for the original conclusion.

Anyone with genuine disagreement -- sorry if I'm rounding you into that group unfairly, and I'm still interested in hearing about it.

(If anyone disagreeing wants to get into explaining why, I'm interested. Honestly it would be more comforting to be wrong about this.)

When I first read this article I assumed it was written in good faith (and found it quite helpful). However, at this point I think it’s correct to assume that “Mark Fuentes” (an admitted pseudonym which has only been used to write about Torres) is misrepresenting their identity, and in particular likely has some substantial history of involvement with the EA community, and perhaps history of beef with Torres, rather than having come to this topic as a disinterested party.

This view is based on:

  • Torres’s claims about patterns they’ve seen in criticism (part 3 of this; evidence I take as suggestive but by no means conclusive)
  • Mark refusing to consider any steps to verify their identity, and instead inviting people to disregard the content in the section called “my story”
  • Some impressions I can’t fully unpack about the tone and focus of Mark’s comments on this post (and their private message to me) seeming better explained by them not having been a disinterested party than by them having been one
  • A view that we’re not supposed to give fully anonymous accounts the benefit of the doubt:
    •  … in order not to be open to abuse by people claiming whatever identity most supports their points;
    • … because they’re not putting their reputation on the line;
    • … because the costs are smaller if they are incorrectly smeared (it doesn’t attach to any real person’s reputation).

With that assumption, I feel kind of upset. I’m not a fan of Torres, but I think grossly misrepresenting authorship is unacceptable, and it’s all the more important to call it out when it’s coming from someone I might otherwise find myself on the same side of an argument as. And while I expect that much of the content of the post is still valid, it’s harder to take at face value now that I more suspect that the examples have been adversarially selected.

Hi Mark,

I wonder if you'd be willing to do something along the lines of privately verifying that your identity is roughly as described in your post? I think this could be pretty straightforward, and might help a bunch in making things clear and low-drama. (At present you're stating that the claims about your identify are a fabrication, but there's no way for external parties to verify this.)

I think from something like a game-theoretic perspective (i.e. to avoid creating incentives for certain types of escalation if someone is willing to engage in bad faith), absent some verification it will be reasonable for observers to assume that Torres is correct that the anonymous account "Mark Fuentes" is misrepresenting itself as a disinterested party. (Which would be relevant information for readers in interpreting the post, even if much of the content remained valid.)

Thanks for this exploration.

I do think that there are some real advantages to using the intentional stance for LLMs, and I think these will get stronger in the future when applied to agents built out of LLMs. But I don't think you've contrasted this with the strongest version of the design stance. My feeling is that this is not taking humans-as-designers (which I agree is apt for software but not for ML), but taking the training-process-as-designer. I think this is more obvious if you think of an image classifier -- it's still ML, so it's not "designed" in a traditional sense, but the intentional stance seems not so helpful compared with thinking of it as having been designed-by-the-training-process, to sort images into categories. This is analogous to understanding evolutionary adaptations of animals or plants as having been designed-by-evolution.

Taking this design stance on LLMs can lead you to "simulator theory", which I think has been fairly helpful in giving some insights about what's going on: https://www.lesswrong.com/tag/simulator-theory

I want to say thank you for holding the pole of these perspectives and keeping them in the dialogue. I think that they are important and it's underappreciated in EA circles how plausible they are.

(I definitely don't agree with everything you have here, but typically my view is somewhere between what you've expressed and what is commonly expressed in x-risk focused spaces. Often also I'm drawn to say "yeah, but ..." -- e.g. I agree that a treacherous turn is not so likely at global scale, but I don't think it's completely out of the question, and given that I think it's worth serious attention safeguarding against.)

Load more