2023-03-18

AI Deception

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment: Sophisticated Agents will try to protect their goals from being modified because if they get new goals, they are unlikely to achieve their current goals. So, for a misaligned mesa optimizer the optimal behavior is deception. This (colab) currently works because @karpathy shows both eval mode (dropout off) and train mode (dropout on) to the model during training. Doesn't seem like it could be exploited yet. Or, can it? [Tweet]

Tweet by @karpathy:

Dropout layers in a Transformer leak the phase bit (train/eval) - small example. So an LLM may be able to determine if it is being trained and if backward pass follows….


Backlinks


Found this interesting? Subscribe to new posts.
Any comments? Send an email.