I've been doing some reading on AI Alignment topics recently. I thought I would note down the latest things I have read, brief summaries of them, and my opinion (following the model of Rohin Shah). This is very preliminary and certainly isn't any kind of unified curriculum.

Learning To Summarize From Human Feedback (Steinnon et. al.)

Summary

The goal of the paper was to teach language models to produce better output than they do by default through the use of human feedback. A model was trained to perform well on the task of summarising both reddit posts and news articles. The model started out as a pretrained text model similar to GPT-3. Then, supervised learning is used to create a model specifically trained for summarization. The supervised model is used to initialize a reward model, which takes two summaries and predicts which a human will like better. The reward model is trained on real human comparisons. Then, reinforcement learning is used to train a generation policy using the PPO algorithm, with reward calculated by the reward model. Performance was good (though regularization to ensure similarity to the supervised model was necessary), and the model also transferred well to summarising news articles. However, it was highly time and compute intensive.

Opinion

Thus far, I'm a big fan of pairwise comparisons, since I think they can more easily scale than other kinds of feedback. This paper is interesting in that it is able to actively guide summaries in a productive direction using this sort of feedback. There were some problems with the approach that, while they were ignored right now, I think will become more and more crucial. First, the people being paid to compare diverged pretty differently, in that people disagreed what was good and what wasn't. Ideally, we would want to be able to train something on collective preferences, or perhaps individual preferences, rather than trying to conform everyone together. Secondly, this was an astonishingly long training process to run. RL is slow, and 320 GPU-days is very very long. A lot of this training I expect to not be "one and done", because different people might have different summary preferences (such as wanting summaries of different lengths). As such, I think the training time here is a big problem. Perhaps, it would be possible to generate several summaries and choose the best one using the reward model; this could reduce training time. The paper is a good proof of concept that it's possible to use pairwise comparisons, but it needs to be made more practical.

Aligning AI With Shared Human Values (Hendrycks et. al.)

Summary

This paper creates a dataset that attempts to collect human ideas about different kinds of ethical scenarios. It identifies five different ethical theories (justice, virtue ethics, deontology, utilitarianism, and commonsense morality). Using a combination of MTurk and Reddit, they generate unambiguous ethical scenarios and then ask humans to make judgements about them. Depending on the theory, several different approaches are used that aim to solicit the correct feedback. Then, language models were trained to attempt to predict these human judgements; they did badly, though better than chance.

Opinion

I think this was a very crisp example of attempting to get a model to predict actual ethical theories, rather than entire models of intent and agency. The methods of soliciting the scenarios and judgements were very clever given the differences of the theories, and I thought it was very well done. It also showed that while it's possible for language models to do OK on these tasks, there is still much room for improvement.