Fall 2022 Reading Group
The topic: AGI (Artificial General Intelligence) Safety Fundamentals.
Why it matters: As powerful AI systems take on increasingly impactful tasks, we need to take great care to prevent catastrophic unintended consequences.
Why else you might enjoy participating: Chipotle and/or Cafe Yumm lunches provided.
The time commitment: Meet five times total this term, 1-2pm on alternating Thursdays. To prepare for each meeting, spend ~1 hour on a few short readings.
The logistical details: See the Event Schedule on aigsa.club.
Communication: Join the conversation on the AIGSA Discord.
Curriculum
Kindly condensed for us by Richard Ngo from https://www.agisafetyfundamentals.com/ai-alignment-curriculum
Artificial general intelligence (Week 2: 10/6)
- Four background claims (Soares, 2015) (15 mins)
- AGI safety from first principles (Ngo, 2020) (from section 1 to end of 2.1) (20 mins)
- More is different for AI (Steinhardt, 2022) (only introduction, second post, third post) (20 mins)
Goals and misalignment (Week 4: 10/20)
- Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
- Goal misgeneralization in deep reinforcement learning (Langosco et al., 2022) (ending after section 3.3) (25 mins).
- Those with less background in reinforcement learning can skip the parts of section 2.1 focused on formal definitions.
- What misalignment looks like as capabilities scale (Ngo, 2022) (only the section titled Realistic training processes lead to the development of misaligned goals, including phases 1, 2 and 3) (30 mins)
Threat models and types of solutions (Week 6: 11/3)
- What failure looks like (Christiano, 2019) (10 mins)
- Intelligence explosion: evidence and import (Muehlhauser and Salamon, 2012) (only pages 10-15) (15 mins)
- AGI safety from first principles (Ngo, 2020) (only section 5: Control) (15 mins)
- AI alignment landscape (Christiano, 2020) (35 mins)
Learning from humans (Week 8: 11/17)
- Read both of the following blog posts, plus the full paper for whichever you found most interesting (if you’re undecided, default to the critiques paper):
- The easy goal inference problem is still hard (Christiano, 2015) (10 mins)
- Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)
Decomposing tasks for outer alignment; interpretability (Week 10: 12/1)
- Summarizing books with human feedback: blog post (Wu et al., 2021) (5 mins)
- AI safety via debate (Irving et al., 2018) (ending after section 3) (35 mins)
- Those without a background in complexity theory can skip section 2.2.
- Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)