Agents¶
CrossEntropyAgent
¶

class
numpy_ml.rl_models.agents.
CrossEntropyAgent
(env, n_samples_per_episode=500, retain_prcnt=0.2)[source]¶ A crossentropy method agent.
Notes
The crossentropy method [1] [2] agent only operates on
envs
with discrete action spaces.On each episode the agent generates n_theta_samples of the parameters (\(\theta\)) for its behavior policy. The i’th sample at timestep t is:
\[\begin{split}\theta_i &= \{\mathbf{W}_i^{(t)}, \mathbf{b}_i^{(t)} \} \\ \theta_i &\sim \mathcal{N}(\mu^{(t)}, \Sigma^{(t)})\end{split}\]Weights (\(\mathbf{W}_i\)) and bias (\(\mathbf{b}_i\)) are the parameters of the softmax policy:
\[\begin{split}\mathbf{z}_i &= \text{obs} \cdot \mathbf{W}_i + \mathbf{b}_i \\ p(a_i^{(t + 1)}) &= \frac{e^{\mathbf{z}_i}}{\sum_j e^{z_{ij}}} \\ a^{(t + 1)} &= \arg \max_j p(a_j^{(t+1)})\end{split}\]At the end of each episode, the agent takes the top retain_prcnt highest scoring \(\theta\) samples and combines them to generate the mean and variance of the distribution of \(\theta\) for the next episode:
\[\begin{split}\mu^{(t+1)} &= \text{avg}(\texttt{best_thetas}^{(t)}) \\ \Sigma^{(t+1)} &= \text{var}(\texttt{best_thetas}^{(t)})\end{split}\]References
[1] Mannor, S., Rubinstein, R., & Gat, Y. (2003). The cross entropy method for fast policy search. In Proceedings of the 20th Annual ICML, 20. [2] Rubinstein, R. (1997). optimization of computer simulation models with rare events, European Journal of Operational Research, 99, 89–112. Parameters:  env (
gym.wrappers()
orgym.envs()
instance) – The environment to run the agent on.  n_samples_per_episode (int) – The number of theta samples to evaluate on each episode. Default is 500.
 retain_prcnt (float) – The percentage of n_samples_per_episode to use when calculating the parameter update at the end of the episode. Default is 0.2.

act
(obs)[source]¶ Generate actions according to a softmax policy.
Notes
The softmax policy assumes that the pmf over actions in state \(x_t\) is given by:
\[\pi(a  x^{(t)}) = \text{softmax}( \text{obs}^{(t)} \cdot \mathbf{W}_i^{(t)} + \mathbf{b}_i^{(t)} )\]where \(\mathbf{W}\) is a learned weight matrix, obs is the observation at timestep t, and b is a learned bias vector.
Parameters: obs (int or ndarray
) – An observation from the environment.Returns: action (int, float, or ndarray
) – An action sampled from the distribution over actions defined by the softmax policy.

run_episode
(max_steps, render=False)[source]¶ Run the agent on a single episode.
Parameters: Returns:  reward (float) – The total reward on the episode, averaged over the theta samples.
 steps (float) – The total number of steps taken on the episode, averaged over the theta samples.

update
()[source]¶ Update \(\mu\) and \(\Sigma\) according to the rewards accrued on the current episode.
Returns: avg_reward (float) – The average reward earned by the best retain_prcnt theta samples on the current episode.
 env (
DynaAgent
¶

class
numpy_ml.rl_models.agents.
DynaAgent
(env, lr=0.4, epsilon=0.1, n_tilings=8, obs_max=None, obs_min=None, q_plus=False, grid_dims=[8, 8], explore_weight=0.05, temporal_discount=0.9, n_simulated_actions=50)[source]¶ A DynaQ / DynaQ+ agent [5] with full TD(0) Qlearning updates via prioritizedsweeping [6] .
Notes
This approach consists of three components: a planning method involving simulated actions, a direct RL method where the agent directly interacts with the environment, and a modellearning method where the agent learns to better represent the environment during planning.
During planning, the agent performs randomsample onestep tabular Qplanning with prioritized sweeping. This entails using a priority queue to retrieve the stateaction pairs from the agent’s history which would stand to have the largest change to their Qvalues if backed up. Specifically, for state action pair (s, a) the priority value is:
\[P = \sum_{s'} p(s')  r + \gamma \max_a \{Q(s', a) \}  Q(s, a) \]which corresponds to the absolute magnitude of the TD(0) Qlearning backup for the pair.
When the first pair in the queue is backed up, the effect on each of its predecessor pairs is computed. If the predecessor’s priority is greater than a small threshold the pair is added to the queue and the process is repeated until either the queue is empty or we have exceeded n_simulated_actions updates. These backups occur without the agent taking any action in the environment and thus constitute simulations based on the agent’s current model of the environment (i.e., its tabular stateaction history).
During the direct RL phase, the agent takes an action based on its current behavior policy and Q function and receives a reward from the environment. The agent logs this stateactionrewardnew state tuple in its interaction table (i.e., environment model) and updates its Q function using a fullbackup version of the Qlearning update:
\[Q(s, a) \leftarrow Q(s, a) + \eta \sum_{r, s'} p(r, s' \mid s, a) \left(r + \gamma \max_a \{ Q(s', a) \}  Q(s, a) \right)\]References
[5] Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th Annual ICML, 216224. [6] Moore, A. & Atkeson, C. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103130. Parameters:  env (
gym.wrappers
orgym.envs
instance) – The environment to run the agent on  lr (float) – Learning rate for the Q function updates. Default is 0.05.
 epsilon (float between [0, 1]) – The epsilon value in the epsilonsoft policy. Larger values encourage greater exploration during training. Default is 0.1.
 n_tilings (int) – The number of overlapping tilings to use if the env observation space is continuous. Unused if observation space is discrete. Default is 8.
 obs_max (float or
ndarray
or None) – The value to treat as the max value of the observation space when calculating the grid widths if the observation space is continuous. If None, useenv.observation_space.high()
. Unused if observation space is discrete. Default is None.  obs_min (float or
ndarray
or None) – The value to treat as the min value of the observation space when calculating grid widths if the observation space is continuous. If None, useenv.observation_space.low()
. Unused if observation space is discrete. Default is None.  grid_dims (list) – The number of rows and columns in each tiling grid if the env observation space is continuous. Unused if observation space is discrete. Default is [8, 8].
 q_plus (bool) – Whether to add incentives for visiting states that the agent hasn’t encountered recently. Default is False.
 explore_weight (float) – Amount to incentivize exploring states that the agent hasn’t recently visited. Only used if q_plus is True. Default is 0.05.
 temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9.
 n_simulated_actions (int) – THe number of simulated actions to perform for each “real” action. Default is 50.

act
(obs)[source]¶ Execute the behavior policy–an \(\epsilon\)soft policy used to generate actions during training.
Parameters: obs (int, float, or ndarray
as returned byenv.step(action)
) – An observation from the environment.Returns: action (int, float, or ndarray
) – An action sampled from the distribution over actions defined by the epsilonsoft policy.

update
()[source]¶ Update the priority queue with the most recent (state, action) pair and perform randomsample onestep tabular Qplanning.
Notes
The planning algorithm uses a priority queue to retrieve the stateaction pairs from the agent’s history which will result in the largest change to its Qvalue if backed up. When the first pair in the queue is backed up, the effect on each of its predecessor pairs is computed. If the predecessor’s priority is greater than a small threshold the pair is added to the queue and the process is repeated until either the queue is empty or we exceed n_simulated_actions updates.

run_episode
(max_steps, render=False)[source]¶ Run the agent on a single episode without performing Qfunction backups.
Parameters: Returns:  reward (float) – The total reward on the episode.
 steps (float) – The number of steps taken on the episode.

train_episode
(max_steps, render=False)[source]¶ Train the agent on a single episode.
Parameters: Returns:  reward (float) – The total reward on the episode.
 steps (float) – The number of steps taken on the episode.
 env (
MonteCarloAgent
¶
Monte Carlo methods are ways of solving RL problems based on averaging sample returns for each stateaction pair. Parameters are updated only at the completion of an episode.
In onpolicy learning, the agent maintains a single policy that it updates over the course of training. In order to ensure the policy converges to a (near) optimal policy, the agent must maintain that the policy assigns nonzero probability to ALL stateaction pairs during training to ensure continual exploration.
 Thus onpolicy learning is a compromise–it learns action values not for the optimal policy, but for a nearoptimal policy that still explores.
In offpolicy learning, the agent maintains two separate policies:
 Target policy: The policy that is learned during training and that will eventually become the optimal policy.
 Behavior policy: A policy that is more exploratory and is used to generate behavior during training.
Offpolicy methods are often of greater variance and are slower to converge. On the other hand, offpolicy methods are more powerful and general than onpolicy methods.

class
numpy_ml.rl_models.agents.
MonteCarloAgent
(env, off_policy=False, temporal_discount=0.9, epsilon=0.1)[source]¶ A MonteCarlo learning agent trained using either firstvisit Monte Carlo updates (onpolicy) or incremental weighted importance sampling (offpolicy).
Parameters:  env (
gym.wrappers
orgym.envs
instance) – The environment to run the agent on.  off_policy (bool) – Whether to use a behavior policy separate from the target policy during training. If False, use the same epsilonsoft policy for both behavior and target policies. Default is False.
 temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9.
 epsilon (float between [0, 1]) – The epsilon value in the epsilonsoft policy. Larger values encourage greater exploration during training. Default is 0.1.

act
(obs)[source]¶ Execute the behavior policy–an \(\epsilon\)soft policy used to generate actions during training.
Parameters: obs (int, float, or ndarray
as returned byenv.step(action)
) – An observation from the environment.Returns: action (int, float, or ndarray
) – An action sampled from the distribution over actions defined by the epsilonsoft policy.

run_episode
(max_steps, render=False)[source]¶ Run the agent on a single episode.
Parameters: Returns:  reward (float) – The total reward on the episode.
 steps (float) – The number of steps taken on the episode.

update
()[source]¶ Update the parameters of the model following the completion of an episode. Flush the episode history after the update is complete.
 env (
TemporalDifferenceAgent
¶
Temporal difference methods are examples of bootstrapping in that they update their estimate for the value of state s on the basis of a previous estimate.
Advantages of TD algorithms:
 They do not require a model of the environment, its reward, or its nextstate probability distributions.
 They are implemented in an online, fully incremental fashion. This allows them to be used with infinitehorizons / when episodes take prohibitively long to finish.
 TD algorithms learn from each transition regardless of what subsequent actions are taken.
 In practice, TD methods have usually been found to converge faster than constant\(\alpha\) Monte Carlo methods on stochastic tasks.

class
numpy_ml.rl_models.agents.
TemporalDifferenceAgent
(env, lr=0.4, epsilon=0.1, n_tilings=8, obs_max=None, obs_min=None, grid_dims=[8, 8], off_policy=False, temporal_discount=0.99)[source]¶ A temporal difference learning agent with expected SARSA (onpolicy) [3] or TD(0) Qlearning (offpolicy) [4] updates.
Notes
The expected SARSA onpolicy TD(0) update is:
\[Q(s, a) \leftarrow Q(s, a) + \eta \left( r + \gamma \mathbb{E}_\pi[Q(s', a') \mid s']  Q(s, a) \right)\]and the TD(0) offpolicy Qlearning upate is:
\[Q(s, a) \leftarrow Q(s, a) + \eta ( r + \gamma \max_a \left\{ Q(s', a) \right\}  Q(s, a) )\]where in each case we have taken action a in state s, received reward r, and transitioned into state \(s'\). In the above equations, \(\eta\) is a learning rate parameter, \(\gamma\) is a temporal discount factor, and \(\mathbb{E}_\pi[ Q[s', a'] \mid s']\) is the expected value under the current policy \(\pi\) of the Q function conditioned that we are in state \(s'\).
Observe that the expected SARSA update can be used for both on and offpolicy methods. In an offpolicy context, if the target policy is greedy and the expectation is taken wrt. the target policy then the expected SARSA update is exactly Qlearning.
NB. For this implementation the agent requires a discrete action space, but will try to discretize the observation space via tiling if it is continuous.
References
[3] Rummery, G. & Niranjan, M. (1994). OnLine Qlearning Using Connectionist Systems. Tech Report 166. Cambridge University Department of Engineering. [4] Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Cambridge University. Parameters:  env (gym.wrappers or gym.envs instance) – The environment to run the agent on.
 lr (float) – Learning rate for the Q function updates. Default is 0.05.
 epsilon (float between [0, 1]) – The epsilon value in the epsilonsoft policy. Larger values encourage greater exploration during training. Default is 0.1.
 n_tilings (int) – The number of overlapping tilings to use if the
env
observation space is continuous. Unused if observation space is discrete. Default is 8.  obs_max (float or
ndarray
) – The value to treat as the max value of the observation space when calculating the grid widths if the observation space is continuous. If None, useenv.observation_space.high
. Unused if observation space is discrete. Default is None.  obs_min (float or
ndarray
) – The value to treat as the min value of the observation space when calculating grid widths if the observation space is continuous. If None, useenv.observation_space.low
. Unused if observation space is discrete. Default is None.  grid_dims (list) – The number of rows and columns in each tiling grid if the env observation space is continuous. Unused if observation space is discrete. Default is [8, 8].
 off_policy (bool) – Whether to use a behavior policy separate from the target policy during training. If False, use the same epsilonsoft policy for both behavior and target policies. Default is False.
 temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9.

run_episode
(max_steps, render=False)[source]¶ Run the agent on a single episode without updating the priority queue or performing backups.
Parameters: Returns:  reward (float) – The total reward on the episode, averaged over the theta samples.
 steps (float) – The total number of steps taken on the episode, averaged over the theta samples.

train_episode
(max_steps, render=False)[source]¶ Train the agent on a single episode.
Parameters: Returns:  reward (float) – The total reward on the episode.
 steps (float) – The number of steps taken on the episode.

act
(obs)[source]¶ Execute the behavior policy–an \(\epsilon\)soft policy used to generate actions during training.
Parameters: obs (int, float, or ndarray
as returned byenv.step(action)
) – An observation from the environment.Returns: action (int, float, or ndarray
) – An action sampled from the distribution over actions defined by the epsilonsoft policy.