A Defense of Work on Mathematical AI Safety

Davidmanheim

AI Safety was, a decade ago, nearly synonymous with obscure mathematical investigations of hypothetical agentic systems. Fortunately or unfortunately, this has largely been overtaken by events; the successes of machine learning and the promise, or threat, of large language models has pushed thoughts of mathematics aside for many in the “AI Safety” community. The once pre-eminent advocate of this class of “agent foundations” research for AI safety, Eliezer Yudkowsky, has more recently said that timelines are too short to allow this agenda to have a significant impact. This conclusion seems at best premature.

Foundational research is useful for prosaic alignment

First, the value of foundational and mathematical research can be synergistic with both technical progress on safety, and with insight into how and where safety is critical. Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research. Current mathematical research could play a similar role in the coming years, as more funding and research are increasingly available for safety. We have also repeatedly seen the importance of foundational research arguments in discussions of policy, from Bostrom’s book to policy discussions at OpenAI, Anthropic, and DeepMind. These connections may be more conceptual than direct, but they are still relevant.

Long timelines are possible

Second, timelines are uncertain. If timelines based on technical progress are short, many claim that we have years not decades until safety must be solved. But this assumes that policy and governance approaches fail, and that we therefore need a full technical solution in the short term. It also seems likely that short timelines make all approaches less likely to succeed. On the other hand, if timelines for technical progress are longer, fundamental advances in understanding, such as those provided by more foundational research, are even more likely to assist in finding or building more technical routes toward safer systems.

Aligning AGI ≠ aligning ASI

Third, even if safety research is successful at “aligning” AGI systems, both via policy and technical solutions, the challenges of ASI (Artificial SuperIntelligence) still loom large. One critical claim of AI-risk skeptics is that recursive self-improvement is speculative, so we do not need to worry about ASI, at least yet. They also often assume that policy and prosaic alignment is sufficient, or that approximate alignment of near-AGI systems will allow them to approximately align more powerful systems. Given any of those assumptions, they imagine a world where humans and AGI will coexist, so that even if AGI captures an increasing fraction of economic value, it won’t be fundamentally uncontrollable. And even according to so-called Doomers, in that scenario, for some period of time it is likely policy changes, governance, limited AGI deployment, and human-in-the-loop and similar oversight methods to limit or detect misalignment will be enough to keep AGI in check. This provides a stop-gap solution, optimistically for a decade or even two - a critical period - but is insufficient later. And despite OpenAI’s recent announcement that they plan to solve Superalignment, there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment, and policy-centric approaches will not allow control.

Resource Allocation

Given the above claims, a final objection is based on resource allocation, in two parts. First, if language model safety was still strongly funding constrained, those areas would be higher leverage, and avenues of foundational and mathematical research would be less marginally beneficial routes for spending. Similarly, if the individuals likely to contribute to mathematical AI safety were all just as well suited to computational deep learning safety research, their skills might be better directed towards machine learning safety. Neither of these is the case.

Of course, investments in agent foundations research are unlikely to directly lead to safety within a few years, and it would be foolish to abandon or short-change efforts that are critical to the coming decade. But even in the short term, these approaches may continue to have important indirect effects, including both deconfusion, and informing other approaches.

As a final point, pessimistically, these types of research are among the least capabilities-relevant AI safety work being considered, so they are low risk. Optimistically, this type of research is very useful in the intermediate term future, and is invaluable should we manage to partially align language models, and need to consider what is next for alignment.

Thank you to Vanessa Kosoy and Edo Arad for helpful suggestions and feedback. All errors are, of course, my own.

50 Reactions

More posts like this

Comments6

Sorted by

New & upvoted

Click to highlight new comments since: Today at 11:38 AM

Joe Rogero10mo13

I enjoyed this post. Short and to the point.

I'd like to add that the stakes are high enough to justify pushing resources into every angle we might reasonably have on the problem. Even if foundational research has only a sliver of a chance of impacting future alignment, that sliver contains quite a lot of value. And I do think it's in fact quite a bit more than a sliver.

Davidmanheim10mo6

My only caveat is that lots of work that is supposed to "help" with reducing existential AI risk is net-negative, due to accelerating capabilities, creating race dynamics, enabling dangerous misuse, etc. But it seems much less likely to be a risk for the type of work described in the post.

Geoffrey Miller10mo7

Nice post; agree with most of it.

One key strength of mathematical AI alignment work is that it's probably extremely cheap compared to 'technical AI alignment work' that requires a lot of skilled programming and computational resources. (Just as mathematical evolutionary theory is much, much cheaper to fund than empirical evolutionary genomics research.)

I would just make a plea for more use of game theory in mathematical AI alignment work. The Yudkowsky-style agent foundations work is valuable. But I think a lot of alignment issues boil down to game-theoretic issues, and we've had a huge amount of increasingly sophisticated work on game theory since the foundational work in the 1940s and 1950s. This goes far, far beyond the pop science accounts of the Prisoner's Dilemma, the Ultimatum Game, the Tragedy of the Commons, and other 'Game Theory 101' examples that many EAs are familiar with.

Harrison Durland10mo5

I’m curious whether people (e.g., David, MIRI folk) think that LLMs now or in the near future would be able to substantially speed up this kind of theoretical safety work?

Davidmanheim10mo4

I would prefer a pause on LLMs that are more capable, in part to give us time to figure out how to align these systems. As I argued, I think mathematical approaches are potentially critical there. But yes, general intelligences could help - I just don't expect them to be differentially valuable for mathematical safety over capabilities, so if they are capable of these types of work, it's a net loss.

Oliver Sourbut10mo1

Great post. I basically agree, but in a spirit of devil's advocating, I will say: when I turn my mind to agent foundations thinking, I often find myself skirting queasily close to concepts which feel also capabilities-relevant (to the extent that I have avoided publicly airing several ideas for over a year).

I don't know if that's just me, but it does seem that some agent foundations content from the past has also had bearing on AI capabilities - especially if we include decision theory stuff, dynamic programming and RL, search, planning etc. which it's arguably artificial to exclude. How would you ex ante distinguish e.g. work which explores properties and constraints of hypothetical planning routines from work which informs creation of more effective planning routines? This sort of thinking seems relevant.