“I think the version of this that’s truly robust ends up looking less like a bunc...”

發言SpeechbyPlex

I think the version of this that’s truly robust ends up looking less like a bunch of handcrafted metrics and more like, “We solved moral philosophy, and from this, it is obvious what all the metrics should be.” This is a fairly hard problem. Otherwise, if you’re just having semi-arbitrary metrics, somewhere up the power level, the AI is able to Goodhart the specific metrics well enough that humans might be crushed in some way they didn’t even know they could be, and still all the metrics look really good. You’ve read Paul Christiano’s “ What Failure Looks Like “?

2025-10-20 Conversation with Plex

顯示前後文Show context

SayIt

鍵盤快捷鍵Keyboard shortcuts