TM

Tom McGrath

80 karmaJoined

Comments
17

Seeing this is a positive update on how much to trust safety research on LLM dangers. A bit disappointing it’s getting much less traction than other papers with worse/nonexistent baselines though.

Cool, apologies if that came across a bit snarky (on rereading it does to me). I think this was instance N+1 of this phrasing and I'd gotten a bit annoyed by instances 1 through N which you obviously bear no responsibility for! I'm happy to have pushed back on the phrasing but hope I didn't cause offence.

A more principled version of (1) would be to appeal to moral uncertainty, or to the idea that a regulator should represent all the stakeholders and I worry than an EA-dominated regulator would fail to do so.

A regulator overrepresenting EA seems bad to me (not an EA) because:

  1. I don't agree with a lot of the beliefs of the EA community on this subject and so I'd expect an EA-dominated regulator to take actions I don't approve of.
  2. Dominance by a specific group makes legitimacy much harder.
  3. The EA community is pretty strongly intertwined with the big labs so most of the concerns from there carry over.

I don't expect (1) to be particularly persuasive for you but maybe (2) and (3) are. I find some of the points in Ways I Expect AI Regulation To Increase X-Risk relevant to issues with overrepresentation of big labs. I think the overrepresentation of big labs would lead to a squashing of open-source, for instance, which I think is currently beneficial and would remain beneficial on the margin for a while.

More generally, I don't particularly like the flattening of specific disagreements on matters of fact (and thus subsequent actions) to "wants people to be safe"/"doesn't want people to be safe". I expect that most people who disagree about the right course of action aren't doing so out of some weird desire to see people harmed/replaced by AI (I'm certainly not) and it seems a pretty unfair dismissal.

I've certainly heard of your work but it's far enough out of my research interests that I've never taken a particularly strong interest. Writing this in this context makes me realise I might have made a bit of a one-man echo chamber for myself... Do you mind if we leave this as 'undecided' for a while?

Regarding ELK - I think the core of the problem as I understand it is fairly clear once you begin thinking about interpretability. Understanding the relation between AI and human ontologies was part of the motivation behind my work on alphazero (as well as an interest in the natural abstractions hypothesis). Section 4 "Encoding of human conceptual knowledge" and Section 8 "Exploring activations with unsupervised methods" are the places to look. The section on challenges and limitations in concept probing I think echoes a lot of the concerns in ELK. 

In terms of subsequent work on ELK, I don't think much of the work on solving ELK was particularly useful, and often reinvented existing methods (e.g. sparse probing, causal interchange interventions). If I were to try and work on it then I think the best way to do so would be to embed the core challenge in a tractable research program, for instance trying to extract new scientific knowledge from ML models like alphafold.

To move this in a more positive direction, the most fruitful/exciting conceptual work I've seen is probably (1) the natural abstractions hypothesis and (2) debate. When I think a bit about why I particularly like these, for (1) it's because it seems plausibly true, extremely useful if true, and amenable to both formal theoretical work and empirical study. For (2) it's because it's a pretty striking new idea that seems very powerful/scalable, but also can be put into practice a bit ahead of really powerful systems.

Yeah, this sounds right to me. At present I feel like a regulator would end up massively overrepresenting at least one of (a) the EA community and (b) large tech corporations with pretty obviously bad incentives.

(I'm spinning this comment out because it's pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)

Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict. 

If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There's not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.

Thanks I mean more in terms of "how can we productively resolve our disagreements about this?", which the EV calculations are downstream of. To be clear, it doesn't seem to me that this is necessarily the hand we've been dealt but I'm not sure how to reduce the uncertainty.

At the risk of sidestepping the question, the obvious move seems to be "try harder to make the claim empirically testable"! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):

  1. Test directly for deception behaviourally and/or mechanistically (I'm aware that people are doing this, think it's good and wish the results were more broadly shared).
  2. Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of "before we get good, undetectable deception do we get kind of crappy detectable deception?"

Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).

Thanks - this is clarifying. I think my confusion was down to not understanding the remit of the pause you're proposing. How about we carry on the discussion in the other comment on this?

Interesting - what do you have in mind for fast-progressing architectures explicitly aimed at creating AGI?

On your 2nd point on x-risks from non-LLM AI, am I right in thinking that you would also hope to catch dual-use scientific AI (for instance) in a compute governance scheme and/or pause? That's a considerably broader remit than I've seen advocates of a pause/compute restrictions argue for and seems much harder to achieve both politically and technically.

I’m trying to make “FLOPstacles” happen for things that mean we can’t just take max FLOP per GPU and multiply by number of GPUs, e.g. mem or interconnect bandwidth.

Load more