
And so, this then closed the loop because once those individual harms are foregrounded and made into something that everybody can see, then we can continuously run those benchmarks against the frontier models so that when they release a new version, it doesn't only compete on the scoreboard of, for example, solving the international Olympiad on mathematics or something, but also make sure that it doesn't regress on those benchmarks where it really did cause people harm.