Analyzing the moral value of unaligned AIs

Matthew_Barnett

A crucial consideration in assessing the risks of advanced AI is the moral value we place on "unaligned" AIs—systems that do not share human preferences—which could emerge if we fail to make enough progress on technical alignment.

In this post I'll consider three potential moral perspectives, and analyze what each of them has to say about the normative value of the so-called "default" unaligned AIs that humans might eventually create:

Standard total utilitarianism combined with longtermism: the view that what matters most is making sure the cosmos is eventually filled with numerous happy beings.
Human species preservationism: the view that what matters most is making sure the human species continues to exist into the future, independently from impartial utilitarian imperatives.
Near-termism or present-person affecting views: what matters most is improving the lives of those who currently exist, or will exist in the near future.

I argue that from the first perspective, unaligned AIs don't seem clearly bad in expectation relative to their alternatives, since total utilitarianism is impartial to whether AIs share human preferences or not. A key consideration here is whether unaligned AIs are less likely to be conscious, or less likely to bring about consciousness, compared to alternative aligned AIs. On this question, I argue that there are considerations both ways, and no clear answers. Therefore, it tentatively appears that the normative value of alignment work is very uncertain, and plausibly approximately neutral, from a total utilitarian perspective.

However, technical alignment work is much more clearly beneficial from the second and third perspectives. This is because AIs that share human preferences are likely to both preserve the human species and improve the lives of those who currently exist. However, in the third perspective, pausing or slowing down AI is far less valuable than in the second perspective, since it forces existing humans to forego benefits from advanced AI, which I argue will likely be very large.

I personally find moral perspectives (1) and (3) most compelling, and by contrast find view (2) to be uncompelling as a moral view. Yet it is only from perspective (2) that significantly delaying advanced AI for alignment reasons seems clearly beneficial, in my opinion. This is a big reason why I'm not very sympathetic to pausing or slowing down AI as a policy proposal.

While these perspectives do not exhaust the scope of potential moral views, and I do not address every relevant consideration in this discussion, I think this analysis can help to sharpen what goals we intend to pursue by promoting particular forms of AI safety work.

Unaligned AIs from a total utilitarian point of view

Let's first consider the normative value of unaligned AIs from the first perspective. From a standard total utilitarian perspective, entities matter morally if they are conscious (under hedonistic utilitarianism) or if they have preferences (under preference utilitarianism). From this perspective, it doesn't actually matter much intrinsically if AIs don't share human preferences, so long as they are moral patients and have their preferences satisfied.

The following is a prima facie argument that utilitarians shouldn't care much about technical AI alignment work. Utilitarianism is typically not seen as partial to human preferences in particular. Therefore, efforts to align AI systems with human preferences—the core aim of technical alignment work—may be considered approximately morally neutral from a utilitarian perspective.

The reasoning here is that changing the preferences of AIs to better align them with the preferences of humans doesn't by itself clearly seem to advance the aims of utilitarianism, in the sense of filling the cosmos with beings who experience positive lives. That's because AI preferences will likely be satisfied either way, whether we do the alignment work or not. In other words, on utilitarian grounds, it doesn't really matter whether the preferences of AIs are aligned with human preferences, or whether they are distinct from human preferences, per se: all that matters is whether the preferences get satisfied.

As a result, prima facie, technical alignment work is not clearly valuable from a utilitarian perspective. That doesn't mean such work is harmful, only that it's not obviously beneficial, and it's very plausibly neutral, from a total utilitarian perspective.

Will unaligned AIs be conscious, or create moral value?

Of course, the argument I have just given is undermined considerably if AI alignment work makes it more likely that future beings will be conscious. In that case, alignment work could clearly be beneficial on total hedonistic utilitarian grounds, as it would make the future more likely to be filled with beings who have rich inner experiences, rather than unconscious AIs with no intrinsic moral value.

But this proposition should be proven, not merely assumed, if we are to accept it. We can consider two general arguments for why the proposition might be true:

Argument one: Aligned AIs are more likely to be conscious, or have moral value, than unaligned AIs.

Argument two: Aligned AIs are more likely to have the preference of creating additional conscious entities and adding them to the universe than unaligned AIs, which would further the objectives of total utilitarianism better than the alternative.

Argument one: aligned AIs are more likely to be conscious, or have moral value, than unaligned AIs

As far as I can tell, the first argument appears to rest on a confusion. There seems to be no strong connection between alignment work and making AIs conscious. Intuitively, whether AIs are conscious is a fundamental property of their underlying cognition, rather than a property of their preferences. Yet, AI alignment work largely only targets AI preferences, rather than trying to make AIs more conscious. Therefore, AI alignment work seems to target AI consciousness only indirectly, if at all.

My guess is that the intuition behind this argument often derives from a stereotyped image of what unaligned AIs might be like. The most common stereotyped image of an unaligned AI is the paperclip maximizer. More generally, it is often assumed that, in the absence of extraordinary efforts to align AIs with human preferences, they are likely to be alien-like and/or have "random" preferences instead. Based on a Twitter poll of mine, I believe this stereotyped image likely plays a major role in why many EAs think that unaligned AI futures would have very little value from a utilitarian perspective.

In a previous post about the moral value of unaligned AI, Paul Christiano wrote,

Many people have a strong intuition that we should be happy for our AI descendants, whatever they choose to do. They grant the possibility of pathological preferences like paperclip-maximization, and agree that turning over the universe to a paperclip-maximizer would be a problem, but don’t believe it’s realistic for an AI to have such uninteresting preferences.
I disagree. I think this intuition comes from analogizing AI to the children we raise, but that it would be just as accurate to compare AI to the corporations we create. Optimists imagine our automated children spreading throughout the universe and doing their weird-AI-analog of art; but it’s just as realistic to imagine automated PepsiCo spreading throughout the universe and doing its weird-AI-analog of maximizing profit.

By contrast, I think I'm broadly more sympathetic to unaligned AIs. I personally don't think my intuition here comes much from analogizing AI to the children we raise, but instead comes from trying to think clearly about the type of AIs humans are likely to actually build, even if we fall short of solving certain technical alignment problems.

First, it's worth noting that even if unaligned AIs have "random" preferences (such as maximizing PepsiCo profit), they could still be conscious. For example, one can imagine a civilization of conscious paperclip maximizers, who derive conscious satisfaction from creating more paperclips. There does not seem to be anything contradictory about such a scenario. And if, as I suspect, consciousness arises naturally in minds with sufficient complexity and sophistication, then it may even be difficult to create an AI civilization without consciousness, either aligned or unaligned.

To understand my point here, consider that humans already routinely get conscious satisfaction from achieving goals that might at first seem "arbitrary" when considered alone. For example, one human's personal desire for a luxury wristwatch does not seem, on its own, to be significantly more morally worthy to me than an AI's desire to create a paperclip. However, in both cases, achieving the goal could have the side effect of fulfilling preferences inside of a morally relevant agent, creating positive value from a utilitarian perspective, even if the goal itself is not inherently about consciousness.

In this sense, even if AIs have goals that seem arbitrary from our perspective, this does not imply that the satisfaction of those goals won't have moral value. Just like humans and other animals, AIs could be motivated to pursue their goals because it would make them feel good, in the broad sense of receiving subjective reward or experiencing satisfaction. For these reasons, it seems unjustified to hastily move from "unaligned AIs will have arbitrary goals" to "therefore, an unaligned AI civilization will have almost no moral value". And under a preference utilitarian framework, this reasoning step seems even less justified, since in that case it shouldn't matter much at all whether the agent is conscious in the first place.

Perhaps more important to my point, it is not clear why we should assume that a default "unaligned" AI will be more alien-like than an aligned AI in morally relevant respects, given that both will likely have similar pretraining. Even if we just focus on AI preferences rather than AI consciousness, it seems most likely to me that unaligned AI preferences will be similar to, but not exactly like, the preferences implicit in the distribution they were trained on. Since AIs will likely be pretrained in large part on human data, this distribution will likely include tons of human concepts about what things in the world hold value. These concepts will presumably play a large role in AI preference formation, even if not in a way that exactly matches the actual preferences of humans.

To be clear: I think it's clearly still possible for an AI to pick up human concepts while lacking human preferences. However, a priori, I don't think there's much reason to assume that value misalignment among AGIs that humans actually create will be as perverse and "random" as wanting to maximize the number of paperclips in existence, assuming these AIs are pretrained largely on human data. In other words, I'm not convinced that alignment work per se will make AIs less "alien" in the morally relevant sense here.

For the reasons stated above, I find the argument that technical alignment work makes AIs more likely to be conscious very uncompelling.

Is consciousness rare and special?

My guess is that many people hold the view that unaligned AIs will have little moral value because they think consciousness might be very rare and special. In this view, it may be argued that consciousness is not something that's likely to arise by chance in almost any circumstance, unless there exist humans that deliberately aim to bring it about. However, I suspect this argument relies on a view of consciousness that is likely overly parochial and simplistic.

Consciousness has arguably already arisen independently multiple times in evolutionary history. For example, is widely believed in many circles (such as my Twitter followers), that both octopuses and humans are conscious, despite the fact that the last common ancestor between these species was an extremely primitive flatworm that lived prior to the Cambrian explosion 530 million years ago—a point which is often taken to be the beginning of complex life on Earth. If consciousness could arise independently multiple times in evolutionary history—in species that share almost no homologous neural structures—it seems unlikely to be a rare and special part of our universe.

More generally, most theories of consciousness given by philosophers and cognitive scientists do not appear to give much significance to properties that are unique to biology. Instead, these theories tend to explain consciousness in terms of higher-level information processing building blocks that AIs and robots will likely share with the animal kingdom.

For example, under either Global Workspace Theory or Daniel Dennett's Multiple Drafts Model, it seems quite likely that very sophisticated AIs—including those that are unaligned with human preferences—will be conscious in a morally relevant sense. If future AIs are trained to perform well in complex, real world physical environments, and are subject to the same type of constraints that animals had to deal with during their evolution, the pressure for them to evolve consciousness seems plausibly equally strong compared to biological organisms.

Why do I have sympathy for unaligned AI?

As a personal note, part of my moral sympathy for unaligned AI comes from generic cosmopolitanism about moral value, which I think is intrinsically downstream from my utilitarian inclinations. As someone with a very large moral circle, I'm happy to admit that strange—even very alien-like—beings could have substantial moral value in their own right, even if they do not share human preferences.

In addition to various empirical questions, I suspect a large part of the disagreement about whether unaligned AIs will have moral value comes down to how much people think these AIs need to be human-like in order for them to be important moral patients. In contrast to perhaps most effective altruists, I believe it is highly plausible that unaligned AIs will have just as much of a moral right to exist and satisfy their preferences as we humans do, even if they are very different from us.

Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives

The second argument for the proposition seems more plausible to me. The idea here is simply that one existing human preference is to bring about more conscious entities into existence. For example, total utilitarians have such a preference, and some humans are (at least partly) total utilitarians. If AIs broadly share the preferences of humans, then at least some AIs will share this particular preference, and we can therefore assume that at least some aligned AIs will try to further the goals of total utilitarianism by creating conscious entities.

While I agree with some of the intuitions behind this argument, I think it's ultimately quite weak. The fraction of humans who are total utilitarians is generally considered to be small. And outside of a desire to have children—which is becoming progressively less common worldwide—most humans do not regularly express explicit and strong preferences to add new conscious entities to the universe.

Indeed, the most common motive humans have for bringing into existence additional conscious entities seems to be to use them instrumentally to satisfy some other preference, such as the human desire to eat meat. This is importantly distinct from wanting to create conscious creatures as an end in itself. Plus, many people have moral intuitions that are directly contrary to total utilitarian recommendations. For example, while utilitarianism generally advocates intervening in wild animal habitats to improve animal welfare, most people favor of habitat preservation, and keeping habitats "natural" instead.

At the least, one can easily imagine an unaligned alternative that's better from a utilitarian perspective. Consider a case where unaligned AIs place less value on preserving nature than humans do. In this situation, unaligned AIs might generate more utilitarian benefit by transforming natural resources to optimize some utilitarian objective, compared to the actions of aligned AIs constrained by human preferences. Aligned AIs, by respecting human desires to conserve nature, would be forgoing potential utilitarian gains. They would refrain from exploiting substantial portions of the natural world, thus preventing the creation of physically constructed systems that could hold great moral worth, such as datacenters built on protected lands capable of supporting immense numbers of conscious AIs.

The human desire to preserve nature is not the only example of a human preference that conflicts with utilitarian imperatives. See this footnote for an additional concrete example.^[1]

My point here is not that humans are anti-utilitarians in general, but merely that most humans have a mix of moral intuitions, and some of these intuitions act against the recommendations of utilitarianism. Therefore, empowering human preferences does not, on a first approximation, look a lot like "giving control over to utilitarians" but rather something different entirely. In general, the human world does not seem well-described as a bunch of utilitarian planners trying to maximize global utility.

And again, unaligned AI preferences seem unlikely to be completely alien or "random" compared to human preferences if AIs are largely trained from the ground-up on human data. In that case, I expect AI moral preferences will most likely approximate human moral preferences to some degree by sharing high-level concepts with us, even if their preferences do not exactly match up with human preferences. Furthermore, as I argued in the previous section, if AIs themselves are conscious, it seems natural for some of them to care about—or at least be motivated by—conscious experience, similar to humans and other animals.

In my opinion, the previous points further undermine the idea that unaligned AI moral preferences will be clearly less utilitarian than the (already not very utilitarian) moral preferences of most humans. In fact, by sharing moral concepts with us, unaligned AIs could be more utilitarian than humans (clearing an arguably already low bar), even if they do not share human preferences.

Moreover, I have previously argued that, if humans solve AI alignment in the technical sense, the main thing that we'll do with our resources is maximize the economic consumption of existing humans at the time of alignment. These preferences are distinct from utilitarian preferences because they are indexical: people largely value happiness, comfort, and wealth for themselves and their families, not for the world as a whole or for all future generations. Notably, this means that human moral preferences are likely to comparatively unimportant even in an aligned future, relative to more ordinary economic forces that already shape our world.

Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations. To put it another way, the key factor influencing whether AIs are conscious in this scenario will be the relative efficiency of creating conscious AIs compared to unconscious ones for producing the goods and services demanded by future people. As these efficiency factors are likely to be similar in both aligned and unaligned scenarios, we have little reason to believe that aligned AIs will generate more consciousness as a byproduct of consumption compared to unaligned AIs.

To summarize my argument in this section:

To the extent AI preferences are unaligned with human moral preferences, it's not clear this is worse than the alternative under an aligned scenario from a utilitarian perspective, since human moral preferences are a complex sum of both utilitarian and anti-utilitarian intuitions, and the relative strength of these forces is not clear. Some human moral intuitions, if empowered, would directly act against the recommendations of utilitarianism. Overall, I don't see strong reasons to think that empowering human moral preferences would advance the objectives of utilitarianism better than a "default" unaligned alternative on net. Since unaligned AIs may share moral concepts with humans, they could plausibly even care more about achieving utilitarian objectives than humans do. At the least, it is easy to imagine unaligned alternatives that perform better by utilitarian lights compared to "aligned" scenarios.
To the extent that AI preferences are unaligned with human economic consumption preferences, I also see no clear argument for why this would be worse, from a utilitarian perspective, than the alternative. Utilitarianism has no intrinsic favoritism for human consumption preferences over e.g. alien-like consumption preferences. In both the case of human consumption and alien-like consumption, consciousness will likely arise as a byproduct of consumption activity. However, if consciousness merely arises as a byproduct of economic activity, there's no clear reason to assume it will be more likely to arise when it's a byproduct of human consumption preferences compared to when it's a byproduct of non-human consumption preferences.
Aligned AIs will likely primarily be aligned to human economic consumption preferences, rather than human moral preferences, strengthening point (2).

In conclusion, I find only weak reasons to believe that utilitarian objectives like filling the universe with happy beings is more likely to happen if AIs are aligned with human preferences, compared to a scenario where they are unaligned. While I do not find the premise here particularly implausible, I also think there are reasonable considerations in both directions. In other words, it seems plausible to me that AI alignment could be either net-bad or net-good from a total utilitarian perspective, and I currently see no strong reasons to think the second possibility is more likely than the first.

As a consequence, this line of reasoning doesn't move me very strongly towards thinking that AI alignment is morally valuable from a utilitarian perspective. Instead, competing moral considerations about AI alignment—and in particular, its propensity to make humans specifically better off—appear to be stronger reasons to think AI alignment is morally worth pursuing.

Human species preservationism

The case for AI alignment being morally valuable is much more straightforward from the perspective of avoiding human extinction. The reason is because, by definition, AI alignment is about ensuring that AIs share human preferences, and one particularly widespread and strong human preference is the desire to avoid death. If AIs share human preferences, it seems likely they will try to preserve the individuals in the human species, and as a side effect, they will likely preserve the human species itself.

According to a standard argument popular in EA and longtermism, reducing existential risk should be a global priority above most other issues, as it threatens not only currently living people, but also the lives of the much more numerous population of people who could one day inhabit the reachable cosmos. As Nick Bostrom put it, "For standard utilitarians, priority number one, two, three and four should consequently be to reduce existential risk. The utilitarian imperative “Maximize expected aggregate utility!” can be simplified to the maxim “Minimize existential risk!”."

Traditionally, human extinction has been seen as the prototypical existential risk. However, as some have noted, human extinction from AI differs fundamentally from scenarios like a giant Earth-bound asteroid. This is because unaligned AIs would likely create a cosmic civilization in our absence. In other words, the alternative to a human civilization in the case of an AI existential catastrophe is merely an AI civilization, rather than an empty universe void of any complex life.

The preceding logic implies that we cannot simply assume that avoiding human extinction from AI is a utilitarian imperative, as we might assume for other existential risks. Indeed, if unaligned AIs are more utility-efficient compared to humans, it may even be preferred under utilitarianism for humans to create unaligned AIs, even if that results in human extinction.

Nonetheless, it is plausible that we should not be strict total longtermist utilitarians, and instead hold the (reasonable) view that human extinction would still be very bad, even if we cannot find a strong utilitarian justification for this conclusion. I concur with this perspective, but dissent from the view that avoiding human extinction should be a priority that automatically outranks other large-scale concerns, such as reducing ordinary death from aging, abolishing factory farming, reducing global poverty, and solving wild animal suffering.

In my view, the main (though not only) reason why human extinction from AI would be bad is that it would imply the deaths of all humans who exist at the time of AI development. But as bad as such a tragedy would be, in my view, it would not be far worse than the gradual death of billions of humans, over a period of several decades, which is the literal alternative humans already face in the absence of radical life extension.

I recognize that many people disagree with my moral outlook here and think that human extinction would be far worse than the staggered, individual deaths of all existing humans from aging over several decades. My guess is that many people disagree with me on this point because they care about the preservation of the human species over and above the lives and preferences of individual people who currently exist. We can call this perspective "human species preservationism".

This view is not inherently utilitarian, as it gives priority to protecting the human species rather than trying to promote the equal consideration of interests. In this sense, it is a speciesist view, in the basic sense that it discriminates on the basis of species membership, ruling out even in principle the possibility of an equally valuable unaligned AI civilization. The fact that this moral view is speciesist is, in my opinion, a decent reason to reject it.^[2]

Human species preservationism is also importantly distinguished from near-termism or present-person affecting views, i.e., the view that what matters is improving the lives of people who either already exist or will exist in the near future. That's because, under the human species preservationist view, it is often acceptable to impose large costs on the current generation of humans, so long as it does not significantly risk the long-term preservation of the human species. For example, the human species preservationist view would find it acceptable to delay a cure to biological aging by 100 years (thereby causing billions of people to die premature deaths) if this had the effect of decreasing the probability of human extinction by 1 percentage point.

Near-termist view of AI risk

As alluded to previously, if we are neither strong longtermist total utilitarians nor care particularly strongly about the preservation of the human species inherently, a plausible alternative is to care primarily about the current generation of humans, or the people who will exist in say, the next 100 years.^[3] We can call this the "near-termist view". In both the human species preservationist view and the near-termist view, AI alignment is clearly valuable, as aligned AIs would have strong reasons to protect the existence of humans. In fact, alignment is even more directly valuable in the near-termist view, as it would obviously be good for currently-existing humans for AIs to share their preferences.

However, the near-termist ethical view significantly departs from the human preservationist view by the way it views slowing down or pausing technological progress, including AI development. This is because slowing down technological progress would likely incur large opportunity costs on the present generation of humans.

Credible economic models suggest that AIs could make humans radically richer. And it is highly plausible that, if AIs can substitute for scientific researchers, they could accelerate technological progress, including in medicine, extending human lifespan and health-span. This would likely dramatically raise human well-being over a potentially short time period. Since these gains are anticipated to be very large, they are highly commensurate with even relatively large probabilities of death on an individual level.

As an analogy, most humans are currently comfortable driving in cars, even though the lifetime probability of a car killing you is greater than 1%. In other words, most humans appear to judge—through their actions—the benefits and convenience of cars as outweighing a 1% lifetime probability of death. Given the credible economic models of AI, it seems likely to me that the benefits of adopting AI are far larger than even the benefits of having access to cars. As a consequence, from the perspective of people who currently exist, it is not obvious that we should delay AI progress substantially even if there is a non-negligible risk that AIs will kill all humans.

As far as I'm aware, the state-of-the-art for modeling this trade-off comes from Chad Jones. A summary of his model is provided as follows,

The curvature of utility is very important. With log utility, the models are remarkably unconcerned with existential risk, suggesting that large consumption gains that A.I. might deliver can be worth gambles that involve a 1-in-3 chance of extinction.
For CRRA utility with a risk aversion coefficient (γ) of 2 or more, the picture changes sharply. These utility functions are bounded, and the marginal utility of consumption falls rapidly. Models with this feature are quite conservative in trading off consumption gains versus existential risk.
These findings even extend to singularity scenarios. If utility is bounded — as it is in the standard utility functions we use frequently in a variety of applications in economics — then even infinite consumption generates relatively small gains. The models with γ ≥ 2 remain conservative with regard to existential risk.
A key exception to this conservative view of existential risk emerges if the rapid innovation associated with A.I. leads to new technologies that extend life expectancy and reduce mortality. These gains are “in the same units” as existential risk and do not run into the sharply declining marginal utility of consumption. Even with a future-oriented focus that comes from low discounting, A.I.-induced mortality reductions can make large existential risks bearable. [emphasis mine]

In short, because of the potential benefits to human wealth and lifespan—from the perspective of people who exist at the time of AI development—it may be beneficial to develop AI even in the face of potentially quite large risks of human extinction, including perhaps a 1-in-3 chance of extinction. If true, this conclusion significantly undermines the moral case for delaying AI for safety or alignment reasons.

One counter-argument is that the potential for future advanced AIs to radically extend human lifespan is very speculative, and therefore it is foolish to significantly risk the extinction of humanity for a speculative chance at dramatically raising human lifespans. However, in my opinion, this argument fails because both the radically good and radically bad possibilities from AI are speculative, and both ideas are ultimately supported by the same underlying assumption: that future advanced AIs will be very powerful, smart, or productive.

To the extent you think that future AIs would not be capable of creating massive wealth for humans, or extending their lifespans, this largely implies that you think future AIs will not be very powerful, smart, or productive. Thus, by the same argument, we should also not think future AIs will be capable of making humanity go extinct. Since this argument symmetrically applies to both bad and good AI potential futures, it is not a strong reason to delay AI development.

A final point to make here is that pausing AI may still be beneficial to currently existing humans if the pause is brief and it causes AI to be much safer as a result (and not merely very slightly safer). This depends on an empirical claim that I personally doubt, although I recognize it as a reasonable counter-point within the context of this discussion. I am not claiming to have discussed every relevant crux in this debate in this short essay.

Conclusion

I have not surveyed anything like an exhaustive set of arguments or moral views regarding AI alignment work or AI pause advocacy. Having said that, I believe the following tentative conclusions likely hold:

It seems difficult to justify AI alignment work via straightforward utilitarian arguments. Arguments that aligned AIs will be more likely to be conscious than unaligned AIs appear strained and confused. Arguments that aligned AIs will be more likely to pursue utilitarian objectives than unaligned AIs appear generally weak, although not particularly implausible either.
Regarding whether we should delay AI development, a key consideration is whether you are a human species preservationist, or whether you care more about the lives and preferences of people who currently exist. In the second case, delaying AI development can be bad even if AI poses a large risk of human extinction. Both of these views come apart from longtermist total utilitarianism, as the first view is speciesist, and the second view is relatively unconcerned with what will happen to non-humans in the very long-term.

The table below summarizes my best guesses on how I suspect each of the three moral perspectives I presented should view the value of AI alignment work and attempts to delay AI development, based on the discussion I have given in this article.

Moral view	Value of AI alignment	Value of delaying advanced AI
Total longtermist utilitarianism	Unclear value, plausibly approximately neutral.	Unclear value. If the delay is done for AI alignment reasons, the value is plausibly neutral, since AI alignment is plausibly neutral under this perspective.
Human species preservationism	Clearly valuable in almost any scenario.	Clearly valuable if AI poses any non-negligible risk to the preservation of the human species.
Near-termism or present-person affecting views	Clearly valuable in almost any scenario.	In my opinion, the value seems likely to be net-negative if AI poses less than a 1-in-3 chance of human extinction or similarly bad outcomes (from the perspective of existing humans).

Perhaps my primary intention while writing this post was to argue against what I perceive to be the naive application of Nick Bostrom's argument in Astronomical Waste for the overwhelming value of reducing existential risk at the cost of delaying technological progress. My current understanding is that this argument—as applied to AI risk—rests on a conflation of existential risk with the risk of human replacement by another form of life. However, from an impartial utilitarian perspective, these concepts are sharply different.

In my opinion, if one is not committed to the preservation of the human species per se (independent of utilitarian considerations, and independent of the individual people who comprise the human species), then the normative case for delaying AI to solve AI alignment is fairly weak. On the other hand, the value of technical AI alignment by itself appears strong from the ordinary perspective that currently-existing people matter. For this reason, among others, I'm generally supportive of (useful) AI alignment work, but I'm not generally supportive of AI pause advocacy.

^{^}
Another example of an anti-total-utilitarian moral intuition that most humans have is the general human reluctance to implement coercive measures to increase human population growth. It is generally strongly recommended under total utilitarianism to increase the population size as much as possible, as long as new lives are not-positive in their contribution to total utility. However, humanity is currently facing a fertility crisis in which birth rates are falling around the world. As far as I'm aware, no country in the world has suggested trying radical policies to increase fertility that would plausibly be recommended under a strict total utilitarian framework, such as forcing people to have children, or legalizing child labor and allowing parents to sell their children's labor, which could greatly increase the economic incentive of having children.
^{^}
A central point throughout this essay is that it's important to carefully consider one's reasons for wanting to preserve the human species. If one's reasons are utilitarian, then the arguments in the first section of this essay apply. If one's reasons are selfish or present-generation-focused, then the arguments in the third section apply. If neither of these explain why you want to preserve the human species, then it is worth reflecting why you are motivated to preserve an abstract category like species rather than actually-existing individuals or things like happiness and preference satisfaction.
^{^}
An alternative way to frame near-termist ethical views is that near-termism is an approximation of longtermism under the assumption that our actions are highly likely to "wash out" over the long-term, and have little to no predictable impact in any particular direction. This perspective can be understood through the lens of two separate considerations:
1. Perhaps our best guess for how to best help the long-term is to do what's best in the short-term in the expectation that the values we helped promote in the short-term might propagate into the long-term future, even though this propagation of values is not guaranteed.
2. If we cannot reliably impact the long-term future, perhaps it is best to focus on actions that affect the short-term future, since this is the only part of the future that we have predictable influence over.

70 Reactions

More posts like this

Comments45

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:33 PM

Ryan Greenblatt22d24

I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won't make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.

(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn't treat this comment as my case for why this is true.)

You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I've edited this comment to add some considerations which I realized I neglected.)

I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.

First:

Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives

It seems odd to me that you don't focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I'll just be talking about a perspective basically like this perspective where creating new beings with "good" lives dominates.

With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:

Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it's possible that this disagreement is driven by some of the other considerations I list.
Exactly what types of beings are created might be much more important than quantity.
Ultimately, I don't care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I'm not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like "politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance". I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don't care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don't reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won't be due to economic production. But I also don't understand a version of preference utilitarianism which seems to match what you're describing, so this seems mostly unimportant.

Given some of our main disagreements, I'm curious what you think humans and unaligned AIs will be economically consuming.

Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won't be coming from incidental consumption.

Matthew_Barnett22d6

With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:

If you could highlight only one consideration that you think I missed in my post, which one would you highlight? And (to help me understand it) can you pose the consideration in the form of an argument, in a way that directly addresses my thesis?

Ryan Greenblatt22d5

Hmm, this is more of a claim then a consideration but I'd highlight:

It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it's possible that this disagreement is driven by some of the other considerations I list.

The main thing this claim disputes is:

Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.

(and some related points).

Sorry, I don't think this exactly addresses your comment. I'll maybe try to do a better job in a bit. I think a bunch of the considerations I mention are relatively diffuse, but important in aggregate.

Ryan Greenblatt22d10

Maybe the most important single consideration is something like:

Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).

So, we should focus on the question of entities trying to create morally valuable lives (or experience or whatever relevant similar property we care about) and then answer this.

(You do seem to talk about "will AIs have more/less utilitarian impulses than humans", but you seem to talk about this almost entirely from the perspective of growing the economy rather than question like how good the lives will be.)

Matthew_Barnett22d4

Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?

I personally feel I addressed this particular question already in the post, although I framed it slightly differently than you have here. So I'm trying to get a better sense as to why you think my argument in the post about this is weak.

A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn't seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn't seem special or rare, so it's likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they'll likely share our high-level concepts about morality even if not our exact preferences.

Let me know what considerations you think I'm still missing here.

[ETA: note that after writing this comment, I sharpened the post slightly to make it a little more clear that this was my position in the post, although I don't think I fundamentally added new content to the post.]

Ryan Greenblatt22d12

Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?

TBC, the main point I was trying to make was that you didn't seem to be presenting arguments about what seems to me like the key questions. Your summary of your position in this comment seems much closer to arguments about the key questions than I interpreted your post being. I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don't stand behind this.

Separately, I think the "maybe AIs/humans will be selfish and/or not morally thoughtful" argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.

(E.g., the argument you make in this comment seemed to me like about 1/6 of your argument in the post and it's still only part of the way toward answering the key questions from my perspective. I think I partially misunderstood the emphasis of your argument in the post.)

I do have arguments for why I think human control is more valuable than control by AIs that seized control from humans, but I'm not going to explain them in detail in this comment. My core summary would be something like "I expect substantial convergence among morally thoughtful humans which reflect toward my utilitarian-ish views, I expect notably less convergence between me and AIs. I expect that AIs have somewhat messed up and complex and specific values in ways which might make them not care about things we care about as a results of current training processes, while I don't have such an arguement for humans."

As far as I what I do think the the key questions are, I think they are something like:

What do humans/AIs have for preference radically longer lives, massive self-enhancement, and potentially long periods of reflection?
- How much do values/views diverge/converge between different altruistically minded humans who've thought about it extremely long durations?
- Even if various entities are into creating "good experiences" how much do these views diverge in what is the best? My guess would be that even if two entities are maximizing good experiences from their perspective the relative goodness/compute can be much lower for the other entity, (e.g. easily 100x lower, maybe more)
- How similar are my views on what is good after reflection to other humans vs AIs?
- How much should we care about worlds where morally thoughtful humans reach radically diffent conclusions on reflection?
Structurally, what sorts of preferences do AI training processes impart on AIs conditionally on these AIs successfully seizing power? I also think this is likely despite humanity likely resisting to at least some extent.

It seems like your argument is something like "who knows about AI preferences, also, they'll probably have similar concepts as we do" and "probably humanity will just have the same observed preferences as they currently do".

But I think we can get much more specific guesses about AI preferences such that this weak indifference principle seems unimportant and I think human preferences will change radically, e.g. preferences will change far more in the next 10 million than in the last the last 2000 years.

Note that I'm not making an argument for greater value on human control in this comment, just trying to explain why I don't think your argument is very relevant. I might try to write up something about my overall views here, but it doesn't seem like my comparative advantage and it currently seems non-urgent from my perspective. (Though embarassing for the field as a whole.)

Matthew_Barnett22d10

I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don't stand behind this.

It's possible we're using these words differently, but I guess I'm not sure why you're downplaying the value of economic consumption here. I focused on economic consumption for a simple reason: economic consumption is intrinsically about satisfying the preferences of agents, including the type of preferences you seem to think matter. For example, I'd classify most human preferences as consumption, including their preference to be happy, which they try to satisfy via various means.

If either a human or an AI optimizes for their own well-being by giving themselves an extremely high intensity positive experience in the future, I don't think that would be vastly morally outweighed by someone doing something similar but for altruistic reasons. Just because the happiness arises from a selfish motive seems like no reason, by itself, to disvalue it from a utilitarian perspective.

As a consequence, I simply do not agree with the intuition that economic consumption is a rounding error compared to the much smaller fraction of resources spent on altruistic purposes.

I think the "maybe AIs/humans will be selfish and/or not morally thoughtful" argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.

I disagree because I don't see why altruism will be more intense than selfishness from a total utilitarian perspective, in the sense you are describing. If an AI makes themselves happy for selfish reasons, that should matter just as much as an AI creating another AI to make them happy.

Now again, you could just think that AIs aren't likely to be conscious, or aren't likely to be motivated to make themselves happy in any sort of selfish sense. And so an unaligned world could be devoid of extremely optimized utilitarian value. But this argument was also addressed at length in my post, and I don't know what your counterargument is to it.

Ryan Greenblatt22d3

It's possible we're using these words differently, but I guess I'm not sure why you're downplaying the value of economic consumption here

Ah, sorry, I was referring to the process of the AI labor being used to accomplish the economic output not having much total moral value. I thought you were arguing that aligned AIs being used to produce goods would have be where most value is coming from because of the vast numbers of such AIs laboring relative to other enitites. Sorry by "from incidental economic consumption" I actually meant "incidentally (as a side effect from) economic consumption". This is in response to things like:

Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations. To put it another way, the key factor influencing whether AIs are conscious in this scenario will be the relative efficiency of creating conscious AIs compared to unconscious ones for producing the goods and services demanded by future people. As these efficiency factors are likely to be similar in both aligned and unaligned scenarios, we are led to the conclusion that, from a total utilitarian standpoint, there is little moral difference between these two outcomes.

As far as the other thing you say, I still disagree, though for different (related) reasons:

As a consequence, I simply do not agree with the intuition that economic consumption is a rounding error compared to the much smaller fraction of resources spent on altruistic purposes.

I don't agree with either "much smaller" and I think rounding error is reasonably likely as far as the selfish preferences of current existing humans or the AIs that seize control go. (These entities might (presumably altruistically) create entities which then selfishly satisfy their preferences, but that seems pretty different.)

My main counterargument is that selfish preference will result in wildly fewer entities if such entities aren't into (presumably altruistically) making more entities and thus will be extremely inefficient. Of course it's possible that you have AIs with non-indexical preferences but which are de facto selfish in other ways.

E.g., for humans you have 10^10 beings which are probably radically inefficient at producing moral value. For AIs it's less clear and depends heavily on how you operationalize selfishness.

I have a general view like "in the future, the main way you'll get specific things that you might care about is via people trying specifically to make those things because optimization is extremely powerful".

I'm probably not going to keep responding as I don't think I'm comparatively advantaged in fleshing this out. And doing this in a comment section seems suboptimal. If this is anyone's crux for working on AI safety though, consider contacting me and I'll consider setting you up with someone who I think understands my views and would to go through relevant arguments with you. Same offer applies to you Matthew particularly if this is a crux, but I think we should use a medium other than EA forum comments.

Matthew_Barnett22d4

I thought you were arguing that aligned AIs being used to produce goods would have be where most value is coming from because of the vast numbers of such AIs laboring relative to other enitites.

Admittedly I worded things poorly in that part, but the paragraph you quoted was intended to convey how consciousness is most likely to come about in AIs, rather than to say that the primary source of value in the world will come from AIs laboring for human consumption.

These are very subtly different points, and I'll have to work on making my exposition here more clear in the future (including potentially re-writing that part of the essay).

E.g., for humans you have 10^10 beings which are probably radically inefficient at producing moral value. For AIs it's less clear and depends heavily on how you operationalize selfishness.

Note that a small human population size is an independent argument here for thinking that AI alignment might not be optimal from a utilitarian perspective. I didn't touch on this point in this essay because I thought it was already getting too complex and unwieldy as it was, but the idea here is pretty simple, and it seems you've already partly spelled out the argument. If AI alignment causes high per capita incomes (because it enriches humans with a small population size), then plausibly this is worse than having a far larger population of unaligned AIs who have lower per capita consumption, from a utilitarian point of view.

Ryan Greenblatt21d5

If AI alignment causes high per capita incomes (because it enriches humans with a small population size), then plausibly this is worse than having a far larger population of unaligned AIs who have lower per capita consumption, from a utilitarian point of view.

Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.

Also, I'm not sold that there will be more AIs, it depends on pretty complex details about AI preferences. I think it's likely AIs won't have preferences for their own experiences given current training methods and will instead have preferences for causing certain outcomes.

Matthew_Barnett21d4

Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.

Both will presumably be forms of consumption, which could be in the form of compute spent on optimized goodness. You seem to think compute will only be used for optimized goodness for non-consumption purposes (which is why you care about the small fraction of resources spent on altruism) and I'm saying I don't see a strong case for that.

Ryan Greenblatt21d1

why you care about the small fraction of resources spent on altruism

I'm also not sold it's that small.

Regardless, doesn't seem like we're making progresss here.

Matthew_Barnett21d3

Regardless, doesn't seem like we're making progresss here.

You have no obligation to reply, of course, but I think we'd achieve more progress if you clarified your argument in a concise format that explicitly outlines the assumptions and conclusion.

As far as I can gather, your argument seems to be a mix of assumptions about humans being more likely to optimize for goodness (why?), partly because they're more inclined to reflect (why?), which will lead them to allocate more resources towards altruism rather than selfish consumption (why is that significant?). Without understanding how your argument connects to mine, it's challenging to move forward on resolving our mutual disagreement.

Rohin Shah2d8

Fwiw I had a similar reaction as Ryan.

My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don't know) rather than humans (where we know some are total utilitarians). In your taxonomy this would be "humans are more likely to optimize for goodness".

Let's make a toy model compatible with your position:

A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn't seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn't seem special or rare, so it's likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they'll likely share our high-level concepts about morality even if not our exact preferences.

Let's say that there are a million values that one could have with "humanity's high-level concepts about morality", one of which is "Rohin's values".

For (3), we'll say that both unaligned AI values and human values are a subset sampled uniformly at random from these million values (all values in the subset weighted equally, for simplicity).

For (1), we'll say that the sampled human values include "Rohin's values", but only as one element in the set of sampled human values.

I won't make any special distinction about consciousness so (2) won't matter.

In this toy model you'd expect aligned AI to put 1/1,000 weight on "Rohin's values", whereas unaligned AI puts 1/1,000,000 weight in expectation on "Rohin's values" (if the unaligned AI has S values, then there's an S/1,000,000 probability of it containing "Rohin's values", and it is weighted 1/S if present). So aligned AI looks a lot better.

More generally, ceteris paribus, keeping values intact prevents drift and so looks strongly positive from the point of view of the original values, relative to resampling values "from scratch".

(Feel free to replace "Rohin's values" with "utilitarianism" if you want to make the utilitarianism version of this argument.)

Imo basically everything that Ryan says in this comment thread is a countercounterargument to a counterargument to this basic argument. E.g. someone might say "oh it doesn't matter which values you're optimizing for, all of the value is in the subjective experience of the AIs that are laboring to build new chips, not in the consumption of the new chips" and the rebuttal to that is "Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value)."

Matthew_Barnett1d4

My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don't know) rather than humans (where we know some are total utilitarians).

I'm curious: Does your reaction here similarly apply to ordinary generational replacement as well?

Let me try to explain what I'm asking.

We have a set of humans who exist right now. We know that some of them are utilitarians. At least one of them shares "Rohin's values". Similar to unaligned AIs, we don't know the values of the next generation of humans, although presumably they will continue to share our high-level moral concepts since they are human and will be raised in our culture. After the current generation of humans die, the next generation could have different moral values.

As far as I can tell, the situation with regards to the next generation of humans is analogous to unaligned AI in the basic sense I've just laid out (mirroring the part of your comment I quoted). So, in light of that, would you similarly say that it's "pretty wild to think that total utilitarian values would be better served by a future generation of humans"?

One possible answer here: "I'm not very worried about generational replacement causing moral values to get worse since the next generation will still be human." But if this is your answer, then you seem to be positing that our moral values are genetic and innate, rather than cultural, which is pretty bold, and presumably merits a defense. This position is IMO largely empirically ungrounded, although it depends on what you mean by "moral values".

Another possible answer is: "No, I'm not worried about generational replacement because we've seen a lot of human generations already and we have lots of empirical data on how values change over time with humans. AI could be completely different." This would be a reasonable response, but as a matter of empirical fact, utilitarianism did not really culturally exist 500 or 1000 years ago. This indicates that it's plausibly quite fragile, in a similar way it might also be with AI. Of course, values drift more slowly with ordinary generational replacement compared to AI, but the phenomenon still seems roughly pretty similar. So perhaps you should care about ordinary value drift almost as much as you'd care about unaligned AIs.

If you do worry about generational value drift in the strong sense I've just described, I'd argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).

Rohin Shah1d6

To the extent that future generations would have pretty different values than me, like "the only glory is in war and it is your duty to enslave your foes", along with the ability to enact their values on the reachable universe, in fact that would seem pretty bad to me.

However, I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI. (This doesn't require values to be innate, values can be adaptive for many human cultures but not for AI cultures.) So I would be less worried about generational value drift (but not completely unworried).

In addition, this worry is tempered even more by the possibility that values / culture will be set much more deliberately in the nearish future, rather than via culture, simply because with an intelligence explosion that becomes more possible to do than it is today.

If you do worry about generational value drift in the strong sense I've just described, I'd argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).

Huh? I feel very confused about this, even if we grant the premise. Isn't the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren't going to enact large-scale change? Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?

Matthew_Barnett21h2

I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI.

To clarify, I think it's a reasonable heuristic that, if you want to preserve the values of the present generation, you should try to minimize changes to the world and enforce some sort of stasis. This could include not building AI. However, I believe you may be glossing over the distinction between: (1) the values currently held by existing humans, and (2) a more cosmopolitan, utilitarian ethical value system.

We can imagine a wide variety of changes to the world that would result in a vast changes to (1) without necessarily being bad according to (2). For example:

We could start doing genetic engineering of humans.
We could upload humans onto computers.
A human-level, but conscious, alien species could immigrate to Earth via a portal.

In each scenario, I agree with your intuition that "the correlation between my values and future humans is higher than the correlation between my values and X-values, because I share much more background with future humans than with X", where X represents the forces at play in each scenario. However, I don't think it's clear that the resulting change to the world would be net negative from the perspective of an impartial, non-speciesist utilitarian framework.

In other words, while you're introducing something less similar to us than future human generations in each scenario, it's far from obvious whether the outcome will be relatively worse according to utilitarianism.

Based on your toy model, my guess is that your underlying intuition is something like, "The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it's very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%."

If this captures your view, my main response is that it seems to assume a much narrower and more fragile conception of "cosmopolitan utilitarian values" than the version I envision, and it's not a moral perspective I currently find compelling.

Conversely, if you're imagining a highly contingent, fragile form of utilitarianism that regards the world as far worse under a wide range of changes, then I'd argue we also shouldn't expect future humans to robustly hold such values. This makes it harder to claim the problem of value drift is much worse for AI compared to other forms of drift, since both are simply ways the state of the world could change, which was the point of my previous comment.

I feel very confused about this, even if we grant the premise. Isn't the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren't going to enact large-scale change?

I'm not sure I understand which part of the idea you're confused about. The idea was simply:

Let's say that your view is that generational value drift is very risky, because future generations could have much worse values from the ones you care about (relative to the current generation)
In that case, you should try to do what you can to stop generational value drift
One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
This would look quite similar to the moral view in which you're trying to protect the current generation of humans, which was the third moral view I discussed in the post.

Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?

The reason the priority would be on current people rather than those with similar values is that, by assumption, future generations will have different values due to value drift. Therefore, the ~best strategy to preserve current values would be to preserve existing people. This seems relatively straightforward to me, although one could certainly question the premise of the argument itself.

Let me know if any part of the simplified argument I've given remains unclear or confusing.

Rohin Shah14h4

Based on your toy model, my guess is that your underlying intuition is something like, "The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it's very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%."

No, this was purely to show why, from the perspective of someone with values, re-rolling those values would seem bad, as opposed to keeping the values the same, all else equal. In any specific scenario, (a) all else won't be equal, and (b) the actual amount of worry depends on the correlation between current values and re-rolled values.

The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that's what you were arguing (e.g. when you say things like "humans are largely not utilitarians themselves"). I don't have a strong view on this and I don't think it really matters for the positions I take.

For example:
We could start doing genetic engineering of humans.
We could upload humans onto computers.
A human-level, but conscious, alien species could immigrate to Earth via a portal.

The first two seem broadly fine, because I still expect high correlation between values. (Partly because I think that cosmopolitan utilitarian-ish values aren't fragile.)

The last one seems more worrying than human-level unaligned AI (more because we have less control over them) but less worrying than unaligned AI in general (since the aliens aren't superintelligent).

Note I've barely thought about these scenarios, so I could easily imagine changing my mind significantly on these takes. (Though I'd be surprised if it got to the point where I thought it was comparable to unaligned AI, in how much the values could stop correlating with mine.)

One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out

It seems way better to simply try to spread your values? It'd be pretty wild if the EA field-builders said "the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours".

Matthew_Barnett14h2

The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that's what you were arguing (e.g. when you say things like "humans are largely not utilitarians themselves").

I think there may have been a misunderstanding regarding the main point I was trying to convey. In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.

When you quoted me saying "humans are largely not utilitarians themselves," I intended this point to support the idea that our current rough level of utilitarianism is not contingent, rather than the opposite claim. In other words, I meant that the fact that humans are not highly utilitarian suggests that this level of utilitarianism is not unusual or contingent upon specific circumstances, and we might expect other intelligent beings, such as aliens or AIs, to exhibit similar, or even greater, levels of utilitarianism.

Compare to the hypothetical argument: humans aren't very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.

(This argument is analogous because pyramids are simple structures that lots of different civilizations would likely stumble upon. Similarly, I think "try to create lots of good conscious experiences" is also a fairly simple directive, if indeed aliens/AIs/whatever are actually conscious themselves.)

I don't have a strong view on this and I don't think it really matters for the positions I take.

I think the question of whether utilitarianism is contingent or not matters significantly for our disagreement, particularly if you are challenging my post or the thesis I presented in the first section. If you are very uncertain about whether utilitarianism is contingent in the sense that is relevant to this discussion, then I believe that aligns with one of the main points I made in that section of my post.

Specifically, I argued that the degree to which utilitarianism is contingent vs. common among a wide range of intelligent beings is highly uncertain and unclear, and this uncertainty is an important consideration when thinking about the values and behaviors of advanced AI systems from a utilitarian perspective. So, if you are expressing strong uncertainty on this matter, that seems to support one of my central claims in that part of the post.

(My view, as expressed in the post, is that unaligned AIs have highly unclear utilitarian value but there's a plausible scenario where they are roughly net-neutral, and indeed I think there's a plausible scenario where they are even more valuable than humans, from a utilitarian point of view.)

It seems way better to simply try to spread your values? It'd be pretty wild if the EA field-builders said "the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours".

I think this part of your comment plausibly confuses two separate points:

How to best further your own values
How to best further the values of the current generation.

I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1). That said, to the extent that your values simply mirror the values of your generation, I don't understand your argument for why trying to spread your values would be "way better" than trying to preserve the current generation. Perhaps you can elaborate?

Rohin Shah1h2

I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1).

I was always thinking about (1), since that seems like the relevant thing. When I agreed with you that generational value drift seems worrying, that's because it seems bad by (1). I did not mean to imply that I should act to maximize (2). I agree that if you want to act to maximize (2) then you should probably focus on preserving the current generation.

In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.

Fwiw, I reread the post again and still failed to find this idea in it, and am still pretty confused at what argument you are trying to make.

At this point I think we're clearly failing to communicate with each other, so I'm probably going to bow out, sorry.

Matthew_Barnett1h2

Fwiw, I reread the post again and still failed to find this idea in it

I'm baffled by your statement here. What did you think I was arguing when discussed whether "aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives"? The conclusion of that section was that aligned AIs are plausibly not more likely to have such a preference, and therefore, human utilitarian preferences here are not "unusually high compared to other possibilities" (the relevant alternative possibility here being unaligned AI).

This was a central part of my post that I discussed at length. The idea that unaligned AIs might be similarly utilitarian or even more so, compared to humans, was a crucial part of my argument. If indeed unaligned AIs are very likely to be less utilitarian than humans, then much of my argument in the first section collapses, which I explicitly acknowledged.

I consider your statement here to be a valuable data point about how clear my writing was and how likely I am to get my ideas across to others who read the post. That said, I believe I discussed this point more-or-less thoroughly.

ETA: Claude 3's summary of this argument in my post:

The post argued that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs. This argument was made in the context of discussing whether aligned AIs are more likely to have a preference for creating new conscious entities, thereby furthering utilitarian objectives.
The author presented several points to support this argument:
Only a small fraction of humans are total utilitarians, and most humans do not regularly express strong preferences for adding new conscious entities to the universe.
Some human moral intuitions directly conflict with utilitarian recommendations, such as the preference for habitat preservation over intervention to improve wild animal welfare.
Unaligned AI preferences are unlikely to be completely alien or random compared to human preferences if the AIs are trained on human data. By sharing moral concepts with humans, unaligned AIs could potentially be more utilitarian than humans, given that human moral preferences are a mix of utilitarian and anti-utilitarian intuitions.
Even in an aligned AI scenario, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.
The author concluded that these points undermine the idea that unaligned AI moral preferences will be clearly less utilitarian than the moral preferences of most humans, which are already not very utilitarian. This suggests that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs.

Rohin Shah32m2

I agree it's clear that you claim that unaligned AIs are plausibly comparably utilitarian as humans, maybe more.

What I didn't find was discussion of how contingent utilitarianism is in humans.

Though actually rereading your comment (which I should have done in addition to reading the post) I realize I completely misunderstood what you meant by "contingent", which explains why I didn't find it in the post (I thought of it as meaning "historically contingent"). Sorry for the misunderstanding.

Let me backtrack like 5 comments and retry again.

Ryan Greenblatt22d3

If I had to pick a second consideration I'd go with:

After millions of years of life (or much more) and massive amounts of cognitive enhancement, the way post-humans might act isn't clearly well predicted by just looking at their current behavior.

Again, I'd like to stress that my claim is:

Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won't be coming from incidental consumption.

Ryan Greenblatt22d3

One additional meta-level point which I think is important: I think that existing writeups of why human control would have more moral value than unaligned AI control from a longtermist perspective are relatively weak and often specific writeups are highly flawed. (For some discussion of flaws, see this sequence.)

I just think that this write-up misses what seem to me to be key considerations, I'm not claiming that existing work settles the question or is even robust at all.

And it's somewhat surprising and embarassing that this is the state of the current work given that longtermism is reasonably common and arguments for working on AI x-risk from a longtermist perspective are also common.

Matthew_Barnett22d2

It seems odd to me that you don't focus almost entirely on this sort of argument when considering total utilitarian style arguments.

I feel I did consider this argument in detail, including several considerations that touch on the arguments you gave. However, I primarily wanted to survey the main points that people have previously given me, rather than focusing heavily on a small set of arguments that someone like you might consider to be the strongest ones. And I agree that I may have missed some important considerations in this post.

In regards to your specific points, I generally find your arguments underspecified because, while reading them, it is difficult for me to identify a concrete mechanism for why alignment with human preferences creates astronomically more value from a total utilitarian perspective relative to the alternative. As it is, you seem to have a lot of confidence that human values, upon reflection, would converge onto values that would be far better in expectation than the alternative. However, I'm not a moral realist, and by comparison to you, I think I don't have much faith in the value of moral reflection, absent additional arguments.

My speculative guess is that part of this argument comes from simply defining "human preferences" as aligned with utilitarian objectives. For example, you seem to think that aligning AIs would help empower the fraction of humans who are utilitarians, or at least would become utilitarians on reflection. But as I argued in the post, the vast majority of humans are not total utilitarians, and indeed, anti-total utilitarian moral intuitions are quite common among humans, which would act against the creation of large amounts of utilitarian value in an aligned scenario.

These are my general thoughts on what you wrote, although I admit I have not responded in detail to any of your specific arguments, and I think you did reveal a genuine blindspot in the arguments I gave. I may write a comment at some future point that considers your comment more thoroughly.

Ryan Greenblatt22d3

As it is, you seem to have a lot of confidence that human values, upon reflection, would converge onto values that would be far better in expectation than the alternative. However, I'm not a moral realist, and by comparison to you, I think I don't have much faith in the value of moral reflection, absent additional arguments.

I'm assuming some level of moral-quasi realism: I care about what I would think is good after reflecting on the situation for a long time and becoming much smarter.

For more on this perspective consider: this post by Holden. I think there is a bunch of other discussion elsewhere from Paul Christiano and Joe Carlsmith, but I can't find posts immediately.

I think the case for being a moral-quasi realist is very strong and depends on very few claims.

My speculative guess is that part of this argument comes from simply defining "human preferences" as aligned with utilitarian objectives.

Not exactly, I'm just defining "the good" as something like "what I would think was good after following a good reflection process which doesn't go off the rails in an intuitive sense". (Aka moral-quasi realism.)

I'm not certain that after reflection I would end up at something which is that well described as utilitarian. Something vaguely in the ball park seems plausible though.

But as I argued in the post, the vast majority of humans are not total utilitarians, and indeed, anti-total utilitarian moral intuitions are quite common among humans, which would act against the creation of large amounts of utilitarian value in an aligned scenario

A reasonable fraction of my view is that many of the moral intuitions of humans might mostly be biases which end up not being that important if people decide to thoughtfully reflect. I predict that humans converge more after reflection and becoming much, much smarter. I don't know exactly what humans converge towards, but it seems likely that I converge toward a cluster which benefits from copious amounts of resources and which has reasonable support among the things which humans think on reflection.

Matthew_Barnett22d2

I'm assuming some level of moral-quasi realism: I care about what I would think is good after reflecting on the situation for a long time and becoming much smarter.

Depending on the structure of this meta-ethical view, I feel like you should be relatively happy to let unaligned AIs do the reflection for you in many plausible circumstances. The intuition here is that if you are happy to defer your reflection to other humans, such as future humans who will replace us in the future, then you should potentially also be open to deferring your reflection to a large range of potential other beings, including AIs who might initially not share human preferences, but would converge to the same ethical views that we'd converge to.

In other words, in contrast to a hardcore moral anti-realist (such as myself) who doesn't value moral reflection much, you seem happier to defer this reflection process to beings who don't share your consumption or current ethical preferences. But you seem to think it's OK to defer to humans but not unaligned AIs, implicitly drawing a moral distinction on the basis of species. Whereas I'm concerned that if I die and get replaced by either humans or AIs, my goals will not be furthered, including in the very long-run.

What is it about the human species exactly that makes you happy to defer your values to other members of that species?

Not exactly, I'm just defining "the good" as something like "what I would think was good after following a good reflection process which doesn't go off the rails in an intuitive sense". (Aka moral-quasi realism.)

I think I have a difficult time fully understanding your view because I think it's a little underspecified. In my view, there seem to be a vast number of different ways that one can "reflect", and intuitively I don't think all (or even most) of these processes will converge to roughly the same place. Can you give me intuitions for why you hold this meta-ethical view? Perhaps you can also be more precise about what you see as the central claims of moral quasi-realism.

Ryan Greenblatt22d3

Depending on the structure of this meta-ethical view, I feel like you should be relatively happy to let unaligned AIs do the reflection for you in many plausible circumstances.

I'm certainly happy if we get to the same place. I think I have feel less good about the view the more contingent it is.

In other words, in contrast to a hardcore moral anti-realist (such as myself) who doesn't value moral reflection much, you seem happier to defer this reflection process to beings who don't share your consumption or current ethical preferences. But you seem to think it's OK to defer to humans but not unaligned AIs, implicitly drawing a moral distinction on the basis of species.

I mean, I certainly think you lose some value from it being other humans. My guess is that you lose more like 5-20x of the value from my perspective with humans than like 1000x and that this 5-20x of the value lost is more like 20-100x for unaligned AI.

I think I have a difficult time fully understanding your view because I think it's a little underspecified. In my view, there seem to be a vast number of different ways that one can "reflect", and intuitively I don't think all (or even most) of these processes will converge to roughly the same place. Can you give me intuitions for why you hold this meta-ethical view? Perhaps you can also be more precise about what you see as the central claims of moral quasi-realism.

I think my views about what I converge to are distinct about my views on quasi-realism. I think a weak notion of quasi-realism is extremely intuitive: you would do better things if you thought more about what would be good (at least relatively to the current returns, eventually returns to thinking would saturate). Because e.g., there are interesting empirical facts (where did my current biases come from evolutionarily? what are brains doing?) I'm not claiming that quasi-realism implies my conclusions, just that it's an important part of where I'm coming from.

I separately think that reflection and getting smarter are likely to cause convergence due to a variety of broad intuitions and some vague historical analysis. I'm not hugely confident in this, but I'm confident enough to think the expect value looks pretty juicy.

calebp22d8

Thanks for writing this.

I disagree with quite a few points in the total utilitarianism section, but zooming out slightly, I think that total utilitarians should generally still support alignment work (and potentially an AI pause/slow down) to preserve option value. If it turns out that AIs are moral patients and that it would be good for them to spread into the universe optimising for values that don't look particularly human, we can still (in principle) do that. This is compatible with thinking that alignment from a total utilitarian perspective is ~neutral - but it's not clear that you agree with this from the post.

Matthew_Barnett22d6

I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.

In my opinion, the choice before the current generation is not whether to delay replacement by a different form of life, but rather to choose our method of replacement: we can either die from old age over decades and be replaced by the next generation of humans, or we can develop advanced AI and risk being replaced by them, but also potentially live much longer and empower our current generation's values.

Deciding to delay AI is not a neutral choice. It only really looks like we're preserving option value in the first case if you think there's something great about the values of the human species. But then if you think that the human species is special, I think these arguments are adequately considered in the first and second sections of my post.

Ryan Greenblatt22d7

Hmm, maybe I'll try to clarify what I think you're arguing as I predict it will be confusing to caleb and bystanders. The way I would have put this is:

It only preserves option value from your perspective to the extent that you think humanity overall^[1] will have a similar perspective as you and will make resonable choices. Matthew seems to think that humanity will use ~all of the resources on (directly worthless?) economic consumption such that the main source of value (from a longtermist, scope sensitive, utilitarian-ish perspective) will be from the minds of the laborers that produce the goods for this consumption. Thus, there isn't any option value as almost all the action is coming from indirect value rather than from people trying to produce value.

I disagree strongly with Matthew on this view about where the value will come from in expectation insofar as that is an accurate interpretation. (I elaborate on why in this comment.) I'm not certain about this being a correct interpretation of Matthew's views, but it at least seems heavily implied by:

Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations. To put it another way, the key factor influencing whether AIs are conscious in this scenario will be the relative efficiency of creating conscious AIs compared to unconscious ones for producing the goods and services demanded by future people. As these efficiency factors are likely to be similar in both aligned and unaligned scenarios, we are led to the conclusion that, from a total utilitarian standpoint, there is little moral difference between these two outcomes.

^{^}
Really, whoever controls resources under worlds where "humanity" keeps control.

Matthew_Barnett22d5

It only preserves option value from your perspective to the extent that you think humanity will have a similar perspective as you and will make resonable choices. Matthew seems to think that humanity will use ~all of the resources on economic consumption such that the main source of value (from a longtermist, scope sensitive, utilitarian-ish perspective) will be from the minds of the laborers that produce the goods for this consumption.

I agree with your first sentence as a summary of my view.

~~The second sentence is also roughly accurate~~[ETA: see comment below for why I am no longer endorsing this], but I do not consider it to be a complete summary of the argument I gave in the post. I gave additional reasons for thinking that the values of the human species are not special from a total utilitarian perspective. This included the point that humans are largely not utilitarians, and in fact frequently have intuitions that would act against the recommendations of utilitarianism if their preferences were empowered. I elaborated substantially on this point in the post.

Matthew_Barnett22d6

On second thought, regarding the second sentence, I think I want to take back my endorsement. I don't necessarily think the main source of value will come from the minds of AIs who labor, although I find this idea plausible depending on the exact scenario. I don't really think I have a strong opinion about this question, and I didn't see my argument as resting on it. And so I'd really prefer it not be seen as part of my argument (and I did not generally try to argue this in the post).

Really, my main point was that I don't actually see much of a difference between AI consumption and human consumption, from a utilitarian perspective. Yet, when thinking about what has moral value in the world, I think focusing on consumption in both cases is generally correct. This includes considerations related to incidental utility that comes as a byproduct from consumption, but the "incidental" part here is not a core part of what I'm arguing.

North And20d0

>I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.

I think it's in the same category of "don't do crime for utilitarian reasons"? Like, if you are not seeing that (trans-)humans are preferable, you are at odds with lots of people who do see it. (and, like, with me personally) Not moustache twirling level of villaining, but you know... you need to be careful with this stuff. You probably don't want to be that part of ea that is literally plotting downfall of human civilization

Jonas Hallgren22d7

I feel like this goes against the principle of not leaving your footprint on the future, no?

Like, a large part of what I believe to be the danger with AI is that we don't have any reflective framework for morality. I also don't believe the standard path for AGI is one of moral reflection. This would then to me say that we leave the value of the future up to market dynamics and this doesn’t seem good with all the traps there are in such a situation? (Moloch for example)

If we want a shot at a long reflection or similar, I don't think full sending AGI is the best thing to do.

Matthew_Barnett22d6

I feel like this goes against the principle of not leaving your footprint on the future, no?

A major reason that I got into longtermism in the first place is that I'm quite interested in "leaving a footprint" on the future (albeit a good one). In other words, I'm not sure I understand the intuition for why we wouldn't deliberately try to leave our footprints on the future, if we want to have an impact. But perhaps I'm misunderstanding the nature of this metaphor. Can you elaborate?

I also don't believe the standard path for AGI is one of moral reflection.

I think it's worth being more specific about why you think AGI will not do moral reflection? In the post, I carefully consider arguments about whether future AIs will be alien-like and have morally arbitrary goals, in a respect that you seem to be imagining. I think it's possible that I addressed some of the intuitions behind your argument here.

Jonas Hallgren22d5

I guess I felt that a lot of the post was arguing under a frame of utilitarianism which is generally fair I think. When it comes to "not leaving a footprint on the future" what I'm referring to is epistemic humility about the correct moral theories. I'm quite uncertain myself about what is correct when it comes to morality with extra weight on utilitarianism. From this, we should be worried about being wrong and therefore try our best to not lock in whatever we're currently thinking. (The classic example being if we did this 200 years ago we might still have slaves in the future)

I'm a believer that virtue ethics and deontology are imperfect information approximations of utilitarianism. Like Kant's categorical imperative is a way of looking at the long-term future and asking, how do we optimise society to be the best that it can be?

I guess a core crux here for me is that it seems like you're arguing a bit for naive utilitarianism here. I actually don't really believe the idea that we will have the AGI follow the VNM-axioms that is being fully rational. I think it will be an internal dynamic system that are weighing for different things that it wants and that it won't fully maximise utility because it won't be internally aligned. Therefore we need to get it right or we're going to have weird and idiosyncratic values that are not optimal for the long-term future of the world.

I hope that makes sense, I liked your post in general.

Ryan Greenblatt22d3

The "footprints on the future" thing could be referencing this post.

(Edit: to be clear, this link is not an endorsement.)

Matthew_Barnett22d6

I see. After briefly skimming that post, I think I pretty strongly disagree with just about every major point in it (along with many of its empirical background assumptions), although admittedly I did not spend much time reading through it. If someone thinks that post provides good reasons to doubt the arguments in my post, I'd likely be happy to discuss the specific ideas within it in more detail.

Jonas Hallgren22d1

Yes, I was on my phone, and you can't link things there easily; that was what I was referring to.

jprwg21d4

Thank you for writing this. I broadly agree with the perspective and find it frustrating how often it’s dismissed based on (what seem to me) somewhat-shaky assumptions.

A few thoughts, mainly on the section on total utilitarianism:

1. Regarding why people tend to assume unaligned AIs won’t innately have any value, or won’t be conscious: my impression is this is largely due to the “intelligence as optimisation process” model that Eliezer advanced. Specifically, that in this model, the key ability humans have that enables us to be so successful is our ability to optimise for goals; whereas mind features we like, such as consciousness, joy, curiosity, friendship, and so on are largely seen as being outside this optimisation ability, and are instead the terminal values we optimise for. (Also that none of the technology we have so far built has really affected this core optimisation ability, so once we do finally build an artificial optimiser it could very well quickly become much more powerful than us, since unlike us it might be able to improve its optimisation ability.)

I think people who buy this model will tend not to be moved much by observations like consciousness having evolved multiple times, as they’d think: sure, but why should I expect that consciousness is part of the optimisation process bit of our minds, specifically? Ditto for other mind features, and also for predictions that AIs will be far more varied than humans — there just isn’t much scope for variety or detail in the process of doing optimisation. You use the phrase “AI civilisation” a few times; my sense is that most people who expect disaster from unaligned AI would say their vision of this outcome is not well-described as a “civilisation” at all.

2. I agree with you that if the above model is wrong (which I expect it is), and AIs really will be conscious, varied, and form a civilisation rather than being a unified unconscious optimiser, then there is some reason to think their consumption will amount to something like “conscious preference satisfaction”, since a big split between how they function when producing vs consuming seems unlikely (even though it’s logically possible).

I’m a bit surprised though by your focus (as you’ve elaborated on in the comments) on consumption rather than production. For one thing, I’d expect production to amount to a far greater fraction of AIs’ experience-time than consumption, I guess on the basis that production enables more subsequent production (or consumption), whereas consumption doesn’t, it just burns resources.

Also, you mentioned concerns about factory farms and wild animal suffering. These seem to me describable as “experiences during production” — do you not have similar concerns regarding AIs’ productive activities? Admittedly pain might not be very useful for AIs, as plausibly if you’re smart enough to see the effects on your survival of different actions, then you don’t need such a crude motivator — even humans trying very hard to achieve goals seem to mostly avoid pain while doing so, rather than using it to motivate themselves. But emotions like fear and stress seem to me plausibly useful for smart minds, and I’d not be surprised if they were common in an AI civilisation in a world where the “intelligence as optimisation process” model is not true. Do you disagree, or do you just think they won’t spend much time producing relative to consuming, or something else?

(To be clear, I agree this second concern has very little relation to what’s usually termed “AI alignment”, but it’s the concern re: an AI future that I find most convincing, and I’m curious on your thoughts on it in the context of the total utilitarian perspective.)

Matthew_Barnett21d3

Thank you for writing this. I broadly agree with the perspective and find it frustrating how often it’s dismissed based on (what seem to me) somewhat-shaky assumptions.

Thanks. I agree with what you have to say about effective altruists dismissing this perspective based on what seem to be shaky assumptions. To be a bit blunt, I generally find that, while effective altruists are often open to many types of criticism, the community is still fairly reluctant to engage deeply with some ideas that challenge their foundational assumptions. This is one of those ideas.

But I'm happy to see this post is receiving net-positive upvotes, despite the disagreement. :)

Vasco Grilo21d4

Great post, Matthew! Misaligned AI not being clearly bad is one of the reasons why I have been moving away from AI safety to animal welfare as the most promising cause area. In my mind, advanced AI would ideally be aligned with expected total hedonistic utilitarianism.

SummaryBot20d4

Executive summary: From a total utilitarian perspective, the value of AI alignment work is unclear and plausibly neutral, while from a human preservationist or near-termist view, alignment is clearly valuable but significantly delaying AI is more questionable.

Key points:

Unaligned AIs may be just as likely to be conscious and create moral value as aligned AIs, so alignment work is not clearly valuable from a total utilitarian view.
Human moral preferences are a mix of utilitarian and anti-utilitarian intuitions, so empowering them may not be better than an unaligned AI scenario by utilitarian lights.
From a human preservationist view, alignment is clearly valuable since it would help ensure human survival, but this view rests on speciesist foundations.
A near-termist view focused on benefits to people alive today would value alignment but not significantly delaying AI, since that could deprive people of potentially massive gains in wealth and longevity.
Arguments for delaying AI to reduce existential risk often conflate the risk of human extinction with the risk of human replacement by AIs, which are distinct from a utilitarian perspective.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

ukc1001421d3

This is really useful in that it examines critically what I think of as the 'orthodox view': alignment is good because it 'allows humans to preserve control over the future'. This view feels fundamental but underexamined, in much of the EA/alignment world (with notable exceptions: Rich Sutton, Robin Hanson, Joscha Bach who seem species-agnostic; Paul Christiano has also fleshed out his position e.g. this part of a Dwarkesh Patel podcast).

A couple of points I wasn't sure I understood/agreed with FWIW:

a) A relatively minor one is

To the extent you think that future AIs would not be capable of creating massive wealth for humans, or extending their lifespans, this largely implies that you think future AIs will not be very powerful, smart, or productive. Thus, by the same argument, we should also not think future AIs will be capable of making humanity go extinct.

I'm not sure about about this symmetry - I can imagine an LLM (~GPT-5 class) integrated into a nuclear/military decision-making system that could cause catastrophic death/suffering (millions/billions of immediate/secondary deaths, massive technological setback, albeit not literal extinction). I'm assuming the point doesn't hinge on literal extinction.

b) Regarding calebp's comment on option value: I agree most option value discussion (doesn't seem to be much outside Bostrom and the s-risk discourse) assumes continuation of the human species, but I wonder if there is room for a more cosmopolitan framing: 'Humans are our only example of an advanced technological civilisation, that might be on the verge of a step change in their evolution. The impact of this evolutionary step-change on the future can arguably be (on balance) good (definition of "good" tbd). The "option value" we are trying to preserve is less the existence of humans per-se, but rather the possibility of such an evolution happening at all. Put another way, we don't an to prematurely introduce an unaligned or misaligned AI (perhaps a weak one) that causes extinction, a bad lock-in, or prevents emergence of more capable AIs that could have achieved this evolutionary transition.'

In other words, the option value is not over the number of human lives (or economic value) but rather over the possible trajectories of the future...this does not seem particularly species-specific. It just says that we should be careful not to throw these futures away.

c) point (b) hinges on why human evolution is 'good' in any broad or inclusive sense (outside of letting current and near-current generations live wealthier, longer lives, if indeed those are good things).

In order to answer this, it feels like we need some way of defining value 'from the point of view of the universe'. That particular phrase is a Sidgwick/Singer thing, and I'm not sure it is directly applicable in this context (like similar phrases e.g. Nagel's 'view from nowhere'), but without this it is very hard to talk about non-species based notions of value (i.e. standard utilitarianism, deontological/virtue approaches all basically rely on human on animal beings).

My candidate for this 'cosmic value' is something like created complexity (which can be physical or not, and can include things that are not obviously economically/militarily/reproductively valuable like art). This includes having trillions of diverse computing entities (human or otherwise).

This is obviously pretty hand-wavey, but I'd be interested in talking to anyone with views (it's basically my PhD :-)