Neural Network Policies

Neural Network Policies#

Neural networks are highly effective for training reinforcement learning algorithms. Similar to other neural network examples, it can be hard to interpret the optimised network. Let’s try to build a network to train for this LunarLander-v3. Our policy network will be a simple three-linear layer network.

import torch.nn as nn
import torch.nn.functional as F

class QNetwork(nn.Module):
    """
    A simple feedforward neural network with 2 hidden layers.
    
    :param obs_dim: The dimension of the input obs.
    :param action_dim: The dimension of the output action.
    """
    def __init__(self, obs_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(obs_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        """
        Forward pass of the network.
        
        :param x: The input obs.
        :return: The output action.
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

However, the broader training strategy will involve using a deep Q-network. This approach uses a Replay Buffer, which reduces the correlation between episodes while enabling the reuse of past experiences. For more on this structure, there is a good Stack Exchange answer on the subject. We construct below the Replay Buffer class.

import random
from collections import deque

class ReplayBuffer:
    """
    A simple replay buffer for storing experiences.
    
    :param capacity: The maximum capacity of the buffer.
    """
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, obs, action, reward, next_obs, done):
        """
        Push an experience to the buffer.
        
        :param obs: The current obs.
        :param action: The action taken.
        :param reward: The reward received.
        :param next_obs: The next obs.
        :param done: Whether the episode is done.
        """
        self.buffer.append((obs, action, reward, next_obs, done))
    
    def sample(self, batch_size):
        """
        Sample a batch of experiences from the buffer.
        
        :param batch_size: The size of the batch.
        :return: The batch of experiences.
        """
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        """
        :return: The length of the buffer.
        """
        return len(self.buffer)

The final stage is to build the training loop. This is where the Q-learning component comes in. We can think of Q-learning as a table that stores the best learning actions, where each cell holds a Q-value and estimates how good that action is for a given state of the environment. The Q-values are then updated iteratively using the Bellman equation,

\[ Q(s, a) = Q(s, a) + \alpha\left[r + \gamma \max Q(s', a') - Q(s, a)\right], \]

where \(s\) and \(s'\) are the current and next observations, \(a\) and \(a'\) are the current and next actions, \(r\) is the reward received for the next action, \(\alpha\) is the learning rate, and $\gamma is the discount factor, which controls how much future rewards matter. Here, we use a deep Q-network, so the neural network estimates them instead of explicitly computing the Q-values. To enable this, we create two networks, the policy_net and the target_net, which provide the Qvalues for the next observation.

To add some randomness to the network, we have a Monte Carlo step utilised in a simulated annealing fashion. This means the likelihood that the Monte Carlo (randomly selected) policy is used decreases as the training progresses. The EPSILON_DECAY hyperparameter controls the amount that this decreases.

import numpy as np
import torch
import torch.optim as optim

GAMMA = 0.99
LEARNING_RATE = 1e-3
BATCH_SIZE = 64
BUFFER_SIZE = 10000
EPSILON_DECAY = 0.995
MIN_EPSILON = 0.01
TARGET_UPDATE = 10

def train(env, episodes=1000):
    obs_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    policy_net = QNetwork(obs_dim, action_dim)
    target_net = QNetwork(obs_dim, action_dim)
    target_net.load_state_dict(policy_net.state_dict())
    target_net.eval()
    
    optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)
    replay_buffer = ReplayBuffer(BUFFER_SIZE)
    epsilon = np.ones(episodes)
    total_reward = np.zeros(episodes)
    
    for episode in range(episodes):
        obs, _ = env.reset()
        obs = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
        done = False
        
        while not done:
            if random.random() < epsilon[episode]:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    action = policy_net(obs).argmax().item()
            
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            next_obs = torch.tensor(next_obs, dtype=torch.float32).unsqueeze(0)
            replay_buffer.push(obs, action, reward, next_obs, done)
            
            obs = next_obs
            total_reward[episode] += reward
            
            if len(replay_buffer) >= BATCH_SIZE:
                batch = replay_buffer.sample(BATCH_SIZE)
                obss, actions, rewards, next_obss, dones = zip(*batch)
                
                obss = torch.cat(obss)
                actions = torch.tensor(actions, dtype=torch.int64).unsqueeze(1)
                rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1)
                next_obss = torch.cat(next_obss)
                dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1)
                
                q_values = policy_net(obss).gather(1, actions)
                next_q_values = target_net(next_obss).max(1, keepdim=True)[0]
                target_q_values = rewards + GAMMA * next_q_values * (1 - dones)
                
                loss = F.mse_loss(q_values, target_q_values.detach())
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
        
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())
        
        if episode < episodes - 1:
            epsilon[episode+1] = max(MIN_EPSILON, epsilon[episode] * EPSILON_DECAY)
    
    return policy_net, total_reward, epsilon

We can now train the network to over 1000 episodes.

import gymnasium as gym

env = gym.make('LunarLander-v3', render_mode='rgb_array')
trained_policy, nn_training_rewards, epsilon = train(env)
/usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

And plot the reward trend as a function of the episode. Note the slight upward trend over the training time.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.plot(nn_training_rewards)
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
plt.show()
../_images/cc89ed1efad50ce4537ac0b1eb981e99e4dc9001a8b1aca2ac415d68c5de03e7.png

We can now use this trained policy over another 500 random episodes to see how it compares to the modulo and logical policies.

nn_rewards = np.zeros((500))
render = None
for repeat in range(nn_rewards.shape[0]):
    current_rewards = 0
    obs = env.reset()[0]
    current_render = []
    for step in range(env.spec.max_episode_steps):
        action = trained_policy(torch.tensor(obs, dtype=torch.float32).unsqueeze(0)).argmax().item()
        obs, reward, terminated, truncated, info = env.step(action)
        current_rewards += reward
        current_render.append(env.render())
        if terminated:
            break
    if current_rewards > nn_rewards.max():
        render = current_render
    nn_rewards[repeat] = current_rewards

    env.close()
total_rewards = np.loadtxt('total_rewards.txt')

fig, ax = plt.subplots()

ax.hist(total_rewards[0], label='Modulo Policy', density=True)
ax.hist(total_rewards[1], label='Logical Policy', density=True)
ax.hist(nn_rewards, label='Neural Network Policy', density=True)
ax.legend()
ax.set_xlabel('Reward')
ax.set_ylabel('p(Reward)')
plt.show()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 1
----> 1 total_rewards = np.loadtxt('total_rewards.txt')
      3 fig, ax = plt.subplots()
      5 ax.hist(total_rewards[0], label='Modulo Policy', density=True)

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/npyio.py:1373, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
   1370 if isinstance(delimiter, bytes):
   1371     delimiter = delimiter.decode('latin1')
-> 1373 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
   1374             converters=converters, skiplines=skiprows, usecols=usecols,
   1375             unpack=unpack, ndmin=ndmin, encoding=encoding,
   1376             max_rows=max_rows, quote=quotechar)
   1378 return arr

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/npyio.py:992, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
    990     fname = os.fspath(fname)
    991 if isinstance(fname, str):
--> 992     fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
    993     if encoding is None:
    994         encoding = getattr(fh, 'encoding', 'latin1')

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/_datasource.py:193, in open(path, mode, destpath, encoding, newline)
    156 """
    157 Open `path` with `mode` and return the file object.
    158 
   (...)    189 
    190 """
    192 ds = DataSource(destpath)
--> 193 return ds.open(path, mode, encoding=encoding, newline=newline)

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/_datasource.py:533, in DataSource.open(self, path, mode, encoding, newline)
    530     return _file_openers[ext](found, mode=mode,
    531                               encoding=encoding, newline=newline)
    532 else:
--> 533     raise FileNotFoundError(f"{path} not found.")

FileNotFoundError: total_rewards.txt not found.

We can see that with no hyperparameter optimisation, the neural network policy is, on average, outperforming the other two approaches.