It’s a new technique. Instead of asking people in Kenya to rate toxic versus non-toxic reply, we can just ask the language model to… It’s like a small constitutional convention that people blend their preferences and then simply steer the AI to fit the matrix as a norm. This is called “Alignment Assemblies” – maybe New Zealand would like to run one.

Keyboard shortcuts

j previous speech k next speech