But visibility, for the dominant architecture, self-attention Transformers, we can measure some of the internals, but we cannot reliably interpret them, and because of this, when we ask, “How dangerous is this?” we’re trying to compute a value where the inspection term is missing, like dividing by zero or something.

Keyboard shortcuts

j previous speech k next speech