Model Based Reinforcement Learning
Table of Contents
Two successful approaches to solving sequential decision making formalized as Markov Decision Process are planning and reinforcement learning. Model-based RL is a field where planning and learning are combined. It is defined as any MDP approach that
- uses a model (known or learned) and
- uses learning to approximate a global value or policy function
A model is a form of reversible access to the MDP dynamics. This allows an agent to plan forward from the same state repeatedly.
In RL we learn solution for the entire state space. In planning, we have access to model, but only find local solution. Model based RL has both, i.e. along with access to the model, so we learn global solution.
1. Model Learning
Dynamics model learning is the crucial first step for model-based RL algorithms with a learned model. Though, if we have a known model, this step can be omitted, and we can do RL based on the model.
1.1. General Classification
Type of Model:
- Forward model: \((s_t, a_t) \rightarrow s_{t+1}\). Predicts the next state given the current state and action. This is the most common type and is used for lookahead planning.
- Backward/Reverse model: \(s_{t+1} \rightarrow (s_t, a_t)\). Predicts precursor states and actions leading to a particular state. Used in methods like prioritized sweeping.
- Inverse model: \((s_t, s_{t+1}) \rightarrow a_t\). Predicts the action needed to transition between two states. Useful for representation learning.
The reward function is also part of the model but is generally easier to learn than the dynamics model.
Estimation Method:
- Parametric
- Exact/Tabular: Maintains a separate entry for every possible transition in a discrete MDP. Estimates probabilities based on counts (Maximum Likelihood Estimation).
- Approximate: Function approximation using Neural Networks.
- Non-parametric: Directly stores and uses the data. Computational complexity depends on dataset size, less applicable to high-dimensional problems.
- Exact: Replay buffers can be seen as non-parametric tabular models.
- Approximate: Generalizes information between states, e.g., Gaussian processes. Can provide uncertainty estimates.
Most model learning methods focus on a forward model, with parametric function approximation, and global coverage.
1.2. Challenges
1.2.1. Stochasticity
In stochastic MDPs, the transition function specifies a distribution over next states. Models should approximate entire distributions, not just the conditional mean.
- Descriptive models: Approximate the entire next state distribution (feasible in small spaces).
- Generative models: Approximate a model from which samples can be drawn.
1.2.2. Uncertainty Due to Limited Data
Model estimates have uncertainty, which can be epistemic (reducible with more data) or aleatory (inherent randomness). Models can quantify uncertainty using methods like Gaussian processes or bootstrapping ensembles. Planning should consider uncertainty to avoid relying on inaccurate model predictions.
1.2.3. Partial Observability
When the agent does not observe the full state, previous observations must be incorporated. Approaches include:
- Windowing: Concatenate recent observations as the state
- Belief states: Explicitly partition the model into an observation model and a latent transition model (similar to HMMs)
- Recurrency and External memory
1.2.4. Non-stationarity
The environment dynamics can change over time. Methods need to adapt to these changes.
1.2.5. Multi-step Prediction
Predicting the state multiple steps into the future.
1.2.6. TODO State Abstraction
Compressing the state into a compact representation (latent space) and modeling transitions in this space. Allows faster planning. Can use autoencoder-like structures and loss functions like inverse dynamics loss to emphasize controllable aspects. Related to grey-box system identification.
1.2.7. Temporal Abstraction (Hierarchical RL)
Defining a high-level action space that extends over multiple timesteps. Can reduce sample and computational complexity.
- Options framework: A popular framework for discrete high-level actions.
- Discovery of sub-routines/abstract actions:
- Graph structure: Identify "bottleneck" states as end-points
- State-space coverage: Cluster the state space and learn transitions between cluster centers
- Compression (information-theoretic): Compress the space of possible end-points
- Reward relevancy: Subroutines emerge from end-to-end optimization that helps incur reward