The Fall 2022 AIGSA reading group is about AGI (Artificial General Intelligence) Safety Fundamentals. Join us for the second meeting, even if you missed the first meeting! Lunch will be provided.
Please RSVP here.
This week’s topic is Goals and misalignment. Here’s an introduction from Richard Ngo, the developer of the curriculum:
This week we’ll focus on how and why AGIs might develop goals that are misaligned with those of humans, in particular when they’ve been trained using machine learning. We cover three core ideas. Firstly, it’s difficult to create reward functions which specify the desired outcomes for complex tasks - the problem of reward misspecification (also known as outer misalignment). Krakovna et al. (2020) helps build intuitions about the difficulty of outer alignment, by showcasing examples of misbehavior on toy problems.
Secondly, however, it’s important to distinguish between the reward function which is used to train a reinforcement learning policy, versus the goals which that policy learns to pursue. Langosco et al. (2022) argue that even an agent trained on the “right” reward function might acquire undesirable goals - the problem of goal misgeneralization (also known as inner misalignment).
Why is it likely that agents will generalize their goals in undesirable ways? Bostrom (2014) argues that almost all goals which an AGI might develop would incentivise it to misbehave badly (e.g. by trying to seek power over the world), due to the phenomenon of instrumental convergence. Ngo (2022) explores in more detail how these problems might arise and play out in a deep learning context.
To prepare, please spend ~1 hour reading these short pieces:
- Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
- Goal misgeneralization in deep reinforcement learning (Langosco et al., 2022) (ending after section 3.3) (25 mins).
- Those with less background in reinforcement learning can skip the parts of section 2.1 focused on formal definitions.
- What misalignment looks like as capabilities scale (Ngo, 2022) (only the section titled Realistic training processes lead to the development of misaligned goals, including phases 1, 2 and 3) (30 mins)