2026-04-24

Bayesian Mechanics of Transformers

Table of Contents

Epistemic Status: These are notes at early stage of learning. Not all information has been rigorously verified. Overall ideas are most likely true, some details might be wrong.

Currently this note is incomplete, there more topics to explore to get to the Bayesian Mechanics of Transformers

1. Computational Mechanics & Transformers

Computational Mechanics gives us the theory of optimal prediction and introduces us to ε-machines. After having the mathematical background we can then we look at parallels between transformers and ε-machines.

The theory of optimal prediction provides a rigorous mathematical foundation for studying next token prediction or any data-generating process. We find that although the same process might be generated from multiple different models (like hidden markov models), there exists a canonical, and in a way the best model for each process, called the ε-machine. And the hidden states of ε-machine are called causal states.

Optimal prediction theory suggests that any system that perfectly minimizes cross entropy loss on sequential data must implicitly infer the hidden causal states of the data-generating process. This means transformers might encode the causal states in their latent vectors. This is beacause for a network trained to minimize cross entropy loss against the data distribution, the model is effectively learning to approximate the entropy rate (irreducible randomness) of the underlying process. And to do that such model necessarily need to encode the causal states in the latent states of the network.

The latent states in a network (either recurrent like LSTM or feedforward networks like transformers), have all the information from past necessary to predict the entire future. Specifically, transformers represent the belief (i.e. probability distribution) over the causal states of the ε-machines in their residual stream.

Various component of transformers connect to the ε-machines in the following way:

Idea Computational Mechanics Transformer
Causual State Equivalence class of histories Residual stream of activation vector
Statistical Complexity Minimal information to store Embedding dimensionality
Entropy Rate Irreducible randomness Cross-entropy loss at convergence

You can send your feedback, queries here