This post represents my own preliminary thoughts. It does not represent Yale, CHAI, or anyone else. At times, I've attempted to summarize the opinions of others, but a summary necessarily leaves out context, and there might be mistakes in my interpretation. Think of this as simply my thoughts and current best guess. Thank you to Evan Hubinger, Scott Emmons and David Africa for conversations that contributed to this (this does not represent any of them, either).

I've been spending the last week or so thinking about language models and their relevance to AI Safety. Language models like GPT-3 have been getting much more attention recently within the AI community in general, mainly due to the fact that they seem to have possible real-world applications. For instance, Ought has recently shifted their focus to developing GPT-3 as a research tool. There is also some GPT-3 research that is more alignment specific, which I'll get to later.

Summary

Recently, what seems like a fairly influential post came out from Ajeya Cotra. Her work at Open Phil, from what I know of it, seems to be fairly broad, and along with her AI Timelines post takes somewhat of a broader view of the alignment landscape. Specifically, the article calls for more work on aligning "narrowly superhuman" models, which it defines as those that are better than humans at a particular domain. It argues that this could serve as "practice" for aligning future, more powerful models. It could also build general capabilities that could contribute to field-building.

The post argues that GPT-3 is an example of the kind of narrow superintelligence that could be aligned. It is superintelligent in the sense that it seems possible for it to provide information (e.g. medical information) more accurately than the average human. It doesn't often do this though, because it usually prefers to act like random reddit users (who made up the training data). Aligning GPT-3 here is generally considered something like "get it to do something useful, and resolve any problems along the way"; perhaps we would even run into a small-scale treacherous turn that could cause larger issues later.

Reactions to the report generally branch when it comes to discussion of what is important in this kind of work. Ajeya Cotra suggests research like Paul Christiano's Learning To Summarize From Human Feedback, which used human pairwise comparisons to generate summaries of reddit posts in a way that was superior to just using raw language models. Essentially, this boils down to "get it to do something useful that should be hard for it to do".

Meanwhile, folks at MIRI like Eliezer Yudkowsky and Evan Hubinger don't seem to see as much value in that kind of research. They are interested in many of the same broad ideas, but are more concerned about figuring out what is actually going on under the hood of GPT-3 and similar models. They reason that forward progress with black box ML systems is bound to be an alignment disaster, so it is important for alignment research to work on understanding the narrowly superintelligent models that exist right now. They seem to favor pushing this type of research towards things like Chris Olah's transparency work.

Thoughts

I see value in the original approach outlined in the post. If alignment researchers do not actively work on making models do useful things before they are already very powerful, they could perhaps miss details about the practical application of their alignment techniques. I am not sure how important this is right now, but it seems (in a mostly outside view) that we would definitely want to align systems where failure is not a big deal before moving on to something else. And when I say align, I mean practically align it, which means make it do useful things for us, not just avoid known safety problems.

I do think feedback work is important, because ultimately training signals should come from humans (rather than direct specification). But with human feedback I worry about scaling, in several senses of the word. First, research on scaling laws seems to suggest that training data scales almost linearly with number of parameters. If human feedback has similar scaling laws, it will prove to be very expensive and very limited. To advanced AI corporations, I worry that will simply cost too much, particularly if there are "shortcuts" that seem to produce good effects but may not be as robust. Of course, this would be resolved if human feedback can be scaled reliably sublinearly.

The other "scaling" problem is that it seems quite difficult to get humans to give feedback to models that are vastly more capable than them, especially given the possibility of deception. There will likely need to be some kind of evaluation scheme that relies on other AI (perhaps amplification, perhaps something else). Work on human feedback could be helpful for AI-provided feedback, but I could easily see AI-to-AI feedback being very different. For instance, even not very capable AI could provide orders of magnitude more feedback than humans, which might mean that different training processes would dominate.

Transparency seems to be more plausibly scalable than human feedback, if only because AIs could use transparency tools on each other in the same way that humans do. If we could figure out, as humans, how to robustly understand what is going on inside GPT-3, it would make it much easier to train an overseer AI to get GPT-3 to do what we want, or at least identify if it is trying to do what we want. In addition, and crucially, this would allow weaker humans and AI models to more easily detect deception in stronger models, which is just as important as getting them to do useful things.

Overall, I think the idea of aligning narrow models seems promising, though there is a lot of room for interpretation about what exactly that means in practice. The idea of making models actually useful is extremely important, in my view. But I also see robust interpretibility tools as probably more beneficial in the long run than robust training procedures, because I think they can scale better.

Culture and Reputational Considerations

The post also contained several points about the field-building, which I will briefly cover here. It essentially argues that this kind of alignment work might lead to more soft power in the AI field, because plausibly people would care more about work that demonstrates real capabilities, rather than theoretical work which might take longer to show results. This could strengthen AI alignment considerations overall, and bring more people into the movement.

I am lukewarm on this. On the one hand, I do think that this sort of work could make AI alignment researchers more credible in the community at large. But I have seen examples of leading AI alignment researchers who are considered very credible, and taken very seriously, up until they say "alignment" or "risk". Stuart Russell literally wrote the book on AI, and he has publicly said that AI risk was not being taken seriously enough; on the 80K podcast he said that he thinks this is likely related to the fact that researchers do not want to think that their work could do harm. I think this kind of alignment research, if it generated truly useful results, could potentially mint more Stuart Russells, and I think the more there are, the better. But I don't think that credibility on the individual level necessarily translates to credibility on the level of research directions and especially motivations. This is also why I worry about presenting work as about something other than AI alignment in order to gain better recruits for it; once it is presented that way, the reputational benefit to creating something useful will essentially go away.

I also worry about organizational value drift here. If organizations bring on lots of new researchers eager to make narrow AI do useful things, without the lens of alignment somewhere close to the forefront of their minds, the organizations are likely to drift towards goals that aren't as...aligned. The field needs to be expanded, and this is one way that could help, but it needs to be carefully portrayed as first and foremost alignment work.

Overall, I am optimistic about this kind of work in the future. Human feedback seems tricky to scale, but could still be extremely fruitful to work on. Transparency tools seem perhaps tricker to implement, but would scale more easily. I also think getting models to do useful things is important for practical alignment fieldbuilding, in the sense that alignment research is not complete without discussion and practice of deployment.