All of the advice on getting into AI safety research that I've seen recommends studying computer science and mathematics: for example, the 80,000 hours AI safety syllabus provides a computer science-focused reading list, and mentions that "Ideally your undergraduate degree would be mathematics and computer science".
There are obvious good reasons for recommending these two fields, and I agree that anyone wishing to make an impact in AI safety should have at least a basic proficiency in them. However, I find it a little concerning that cognitive science/psychology are rarely even mentioned in these guides. I believe that it would be valuable to have more people working in AI safety whose primary background is from one of cogsci/psych, or who have at least done a minor in them.
Here are examples of four lines of research into AI safety which I think could benefit from such a background:
- The psychology of developing an AI safety culture. Besides the technical problem of "how can we create safe AI", there is the social problem of "how can we ensure that the AI research community develops a culture where safety concerns are taken seriously". At least two existing papers draw on psychology to consider this problem: Eliezer Yudkowsky's "Cognitive Biases Potentially Affecting Judgment of Global Risks" uses cognitive psychology to discuss why people might misjudge the probability of risks in general, and Seth Baum's "On the promotion of safe and socially beneficial artificial intelligence" uses social psychology to discuss the specific challenge of motivating AI researchers to choose beneficial AI designs.
- Developing better analyses of "AI takeoff" scenarios. Currently humans are the only general intelligence we know of, so any analyzes of what "expertise" consists of and how it can be acquired would benefit from the study of humans. Eliezer Yudkowsky's "Intelligence Explosion Microeconomics" draws on a number of fields to analyze the possibility of a hard takeoff, including some knowledge of human intelligence differences as well as the history of human evolution, whereas my "How Feasible is the Rapid Development of Artificial Superintelligence?" draws extensively on the work of a number of psychologists to make the case that based on what we know of human expertise, scenarios with AI systems becoming major actors within timescales on the order of mere days or weeks seem to remain within the range of plausibility.
- Defining just what it is that human values are. The project of AI safety can roughly be defined as "the challenge of ensuring that AIs remain aligned with human values", but it's also widely acknowledged that nobody really knows what exactly human values are - or at least, not to a sufficient extent that they could be given a formal definition and programmed into an AI. This seems like one of the core problems of AI safety, and one which can only be understood with a psychology-focused research program. Luke Muehlhauser's article "A Crash Course in the Neuroscience of Human Motivation" took one look at human values from the perspective of neuroscience, and my "Defining Human Values for Value Learners" sought to provide a preliminary definition of human values in a computational language, drawing from the intersection of artificial intelligence, moral psychology, and emotion research. Both of these are very preliminary papers, and it would take a full research program to pursue this question in more detail.
- Better understanding multi-level world-models. MIRI defines the technical problem of "multi-level world-models" as "How can multi-level world-models be constructed from sense data in a manner amenable to ontology identification?". In other words, suppose that we had built an AI to make diamonds (or anything else we care about) for us. How should that AI be programmed so that it could still accurately estimate the number of diamonds in the world after it had learned more about physics, and after it had learned that the things it calls "diamonds" are actually composed of protons, neutrons, and electrons? While I haven't seen any papers that would explicitly tackle this question yet, a reasonable starting point would seem to be the question of "well, how do humans do it?". There, psych/cogsci may offer some clues. For instance, in the book Cognitive Pluralism, the philosopher Steven Horst offers an argument for believing that humans have multiple different, mutually incompatible mental models / reasoning systems - ranging from core knowledge systems to scientific theories - that they flexibly switch between depending on the situation. (Unfortunately, Horst approaches this as a philosopher, so he's mostly content at making the argument for this being the case in general, leaving it up to actual cognitive scientists to work out how exactly this works.) I previously also offered a general argument along these lines in my article World-models as tools, suggesting that at least part of the choice of a mental model may be driven by reinforcement learning in the basal ganglia. But this isn't saying much, given that all human thought and behavior seems to be in at least part driven by reinforcement learning in the basal ganglia. Again, this would take a dedicated research program.
From these four special cases, you could derive more general use cases for psychology and cognitive science within AI safety:
- Psychology as the study and understanding of human thought and behavior, helps guide actions that are aimed at understanding and influencing people's behavior in a more safety-aligned direction (related example: the psychology of developing an AI safety culture)
- The study of the only general intelligence we know about, may provide information about the properties of other general intelligences (related example: developing better analyzes of "AI takeoff" scenarios)
- A better understanding of how human minds work, may help figure out how we want the cognitive processes of AIs to work so that they end up aligned with our values (related examples: defining human values, better understanding multi-level world-models)
Here I would ideally offer reading recommendations, but the fields are so broad that any given book can only give a rough idea of the basics; and for instance, the topic of world-models that human brains use is just one of many, many subquestions that the fields cover. Thus my suggestion to have some safety-interested people who'd actually study these fields as a major or at least a minor.
Still, if I'd have to suggest a couple of books, with the main idea of getting a basic grounding in the mindsets and theories of the fields so that it would be easier to read more specialized research... on the cognitive psychology/cognitive science side I'd suggest Cognitive Science by Jose Luis Bermudez (haven't read it, but Luke Muehlhauser recommends it and it looked good to me based on the table of contents; see also Luke's follow-up recommendations behind that link); Cognitive Psychology: A Student's Handbook by Michael W. Eysenck & Mark T. Keane; and maybe Sensation and Perception by E. Bruce Goldstein. I'm afraid that I don't know of any good introductory textbooks on the social psychology side.
It took me a while to respond to this because I wanted to take the time to read "The Normative Insignificance of Neuroscience" first. Having now read it, I'd say that I agree with its claims with regard to criticism of Greene's approach. I don't think it disproves the notion of psychology being useful for defining human values, though, for I think there's an argument for psychology's usefulness that's entirely distinct from the specific approach that Greene is taking.
I start from the premise that the goal of moral philosophy is to develop a set of explicit principles that would tell us what is good. Now this is particularly relevant for designing AI, because we also want our AIs to follow those principles. But it's noteworthy that at their current state, none of the existing ethical theories are up to the task of giving us such a set of principles that, when programmed into an AI, would actually give results that could be considered "good". E.g. Muehlhauser & Helm 2012:
Yet the problem remains: the AI has to be programmed with some definition of what is good.
Now this alone isn't yet sufficient to show that philosophy wouldn't be up to the task. But philosophy has been trying to solve ethics for at least the last 2500 years, and it doesn't look like there would have been any major progress towards solving it. The PhilPapers survey didn't show any of the three major ethical schools (consequentialism, deontology, virtue ethics) being significantly more favored by professional philosophers than the others, nor does anyone - to my knowledge - even know what a decisive theoretical argument in favor of one of them could be.
And at this point, we have pretty good theoretical reasons for believing that the traditional goal of moral philosophy - "developing a set of explicit principles for telling us what is good" - is in fact impossible. Or at least, it's impossible to develop a set of principles that would be simple and clear enough to write down in human-understandable form and which would give us clear answers to every situation, because morality is too complicated for that.
We've already seen this in trying to define concepts: as philosophy noted a long time ago, you can't come up with a set of explicit rules that would define even any concept even as simple as "man" in such a way that nobody could develop a counterexample. "The Normative Insignificance of Neuroscience" also notes that the situation in ethics looks similar to the situation with trying to define many other concepts:
Yet human brains do manage to successfully reason with concepts, despite it being impossible to develop a set of explicit necessary and sufficient criteria. The evidence from both psychology and artificial intelligence (where we've managed to train neural nets capable of reasonably good object recognition) is that a big part of how they do it is by building up complicated statistical models of what counts as a "man" or "philosopher" or whatever.
So given that
and
it would seem reasonable to assume that
But here it looks likely that we need information from psychology to narrow down what those models should be. What humans consider to be good has likely been influenced by a number of evolutionary idiosyncrasies, so if we want to come up with a model of morality that most humans would agree with, then our AI's reasoning process should take into account those considerations. And we've already established that defining those considerations on a verbal level looks insufficient - they have to be established on a deeper level, of "what are the actual computational processes that are involved when the brain computes morality".
Yes, I am here assuming "what is good" to equate to "what do human brains consider good", in a way that may be seen as reducing to "what would human brains accept as a persuasive argument for what is good". You could argue that this is flawed, because it's getting dangerously close to defining "good" by social consensus. But then again, the way the field of ethics itself proceeds is basically the same: a philosopher presents an argument for what is good, another attacks it, if the argument survives attacks and is compelling then it is eventually accepted. For empirical facts we can come up with objective tests, but for moral truths it looks to me unavoidable - due to the is-ought gap - that some degree of "truth by social consensus" is the only way of figuring out what the truth is, even in principle.
Great!
... (read more)