AI Deception

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment: Sophisticated Agents will try to protect their goals from being modified because if they get new goals, they are unlikely to achieve their current goals. So, for a misaligned mesa optimizer the optimal behavior is deception. This (colab) currently works because @karpathy shows both eval mode (dropout off) and train mode (dropout on) to the model during training. Doesn't seem like it could be exploited yet. Or, can it? [Tweet]

Tweet by @karpathy:

Dropout layers in a Transformer leak the phase bit (train/eval) - small example. So an LLM may be able to determine if it is being trained and if backward pass follows….

Backlinks

Artificial Intelligence