I think the version of this that’s truly robust ends up looking less like a bunch of handcrafted metrics and more like, “We solved moral philosophy, and from this, it is obvious what all the metrics should be.” This is a fairly hard problem. Otherwise, if you’re just having semi-arbitrary metrics, somewhere up the power level, the AI is able to Goodhart the specific metrics well enough that humans might be crushed in some way they didn’t even know they could be, and still all the metrics look really good. You’ve read Paul Christiano’s “ What Failure Looks Like “?