stuhlmueller

223 karmaJoined Mar 2019

Message

Posts
3

Sorted by New

Discovering alignment windfalls reduces AI risk

goodgravy

· 2mo ago · 10m read

AMA: Ought

stuhlmueller

· 2y ago · 1m read

Ought's theory of change

stuhlmueller

· 2y ago · 3m read

Comments
22

Discovering alignment windfalls reduces AI risk

stuhlmueller2mo1

Another potential windfall I just thought of: the kind of AI scientist system discussed by Bengio in this talk (older writeup). The idea is to build a non-agentic system that uses foundation models and amortized Bayesian inference to create and do inference on compositional and interpretable world models. One way this would be used is for high-quality estimates of p(harm|action) in the context of online monitoring of AI systems, but if it could work it would likely have other profitable use cases as well.

If FTX is liquidated, who ends up controlling Anthropic?

stuhlmueller1y34

This paywalled article mentions a $4B valuation for the round:

AI Safety Needs Great Product Builders

stuhlmueller2y5

A concrete version of this I've been wondering about the last few days: To what extent are the negative results on Debate (single-turn, two-turn) intrinsic to small-context supervision vs. a function of relatively contingent design choices about how people get to interact with the models?

AMA: Ought

stuhlmueller2y3

I agree that misuse is a concern. Unlike alignment, I think it's relatively tractable because it's more similar to problems people are encountering in the world right now.

To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.

This isn't to say that we might not need to invest a lot of resources eventually, and it's interestingly related to alignment ("misuse" is relative to some values), but it feels a bit less open-ended.

AMA: Ought

stuhlmueller2y2

Elicit is using using the Semantic Scholar Academic Graph dataset. We're working on expanding to other sources. If there are particular ones that would be helpful, message me?

AMA: Ought

stuhlmueller2y1

Have you listened to the 80k episode with Nova DasSarma from Anthropic? They might have cybersecurity roles. The closest we have right now is devops—which, btw, if anyone is reading this comment, we are really bottlenecked on and would love intros to great people.

AMA: Ought

stuhlmueller2y2

No, it's that our case for alignment doesn't rest on "the system is only giving advice" as a step. I sketched the actual case in this comment.

AMA: Ought

stuhlmueller2y2

Oh, forgot to mention Jonathan Uesato at Deepmind who's also very interested in advancing the ML side of factored cognition.

AMA: Ought

stuhlmueller2y3

The things that make submodels easier to align that we’re aiming for:

(Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
(Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming

AMA: Ought

stuhlmueller2y3

For AGI there isn't much of a distinction between giving advice and taking actions, so this isn't part of our argument for safety in the long run. But in the time between here and AGI it's better to focus on supporting reasoning to help us figure out how to manage this precarious situation.

stuhlmueller

Posts 3

Comments22

Posts
3

Comments
22