Yeah. And I think the main thing we’re being asked by our population and constituents and MPs is just whether there’s a benchmark. Can we objectively say that this is the limit of prompt engineering, this is the limit of custom instructions. You really cannot squeeze any more language understanding out of the system prompts basically. So this is what we’re going to do. We’ve, as part of the AIECC, the Evaluation Certificate Center, going to set up by the end of the year, we will have a set of evals, of benchmarks, and through alignment assemblies, we’re going to crowdsource those evaluations. And I understand OpenAI has your own on GitHub, the open evals, so to speak. And I think it’s time that we seriously invest in evals as part of safety. You probably know there’s this figure that there’s 30 capability researchers to every one safety researcher.

Keyboard shortcuts

j previous speech k next speech