Neural Network Policies

Neural Network Policies#

Neural networks are highly effective for training reinforcement learning algorithms. Similar to other neural network examples, it can be hard to interpret the optimised network. Let’s try to build a network to train for this LunarLander-v3. Our policy network will be a simple three-linear layer network.

import torch.nn as nn
import torch.nn.functional as F

class QNetwork(nn.Module):
    """
    A simple feedforward neural network with 2 hidden layers.
    
    :param obs_dim: The dimension of the input obs.
    :param action_dim: The dimension of the output action.
    """
    def __init__(self, obs_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(obs_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        """
        Forward pass of the network.
        
        :param x: The input obs.
        :return: The output action.
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

However, the broader training strategy will involve using a deep Q-network. This approach uses a Replay Buffer, which reduces the correlation between episodes while enabling the reuse of past experiences. For more on this structure, there is a good Stack Exchange answer on the subject. We construct below the Replay Buffer class.

import random
from collections import deque

class ReplayBuffer:
    """
    A simple replay buffer for storing experiences.
    
    :param capacity: The maximum capacity of the buffer.
    """
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, obs, action, reward, next_obs, done):
        """
        Push an experience to the buffer.
        
        :param obs: The current obs.
        :param action: The action taken.
        :param reward: The reward received.
        :param next_obs: The next obs.
        :param done: Whether the episode is done.
        """
        self.buffer.append((obs, action, reward, next_obs, done))
    
    def sample(self, batch_size):
        """
        Sample a batch of experiences from the buffer.
        
        :param batch_size: The size of the batch.
        :return: The batch of experiences.
        """
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        """
        :return: The length of the buffer.
        """
        return len(self.buffer)

The final stage is to build the training loop. This is where the Q-learning component comes in. We can think of Q-learning as a table that stores the best learning actions, where each cell holds a Q-value and estimates how good that action is for a given state of the environment. The Q-values are then updated iteratively using the Bellman equation,

\[ Q(s, a) = Q(s, a) + \alpha\left[r + \gamma \max Q(s', a') - Q(s, a)\right], \]

where $s$ and $s'$ are the current and next observations, $a$ and $a'$ are the current and next actions, $r$ is the reward received for the next action, $\alpha$ is the learning rate, and $\gamma is the discount factor, which controls how much future rewards matter. Here, we use a deep Q-network, so the neural network estimates them instead of explicitly computing the Q-values. To enable this, we create two networks, the policy_net and the target_net, which provide the Qvalues for the next observation.

To add some randomness to the network, we have a Monte Carlo step utilised in a simulated annealing fashion. This means the likelihood that the Monte Carlo (randomly selected) policy is used decreases as the training progresses. The EPSILON_DECAY hyperparameter controls the amount that this decreases.

import numpy as np
import torch
import torch.optim as optim

GAMMA = 0.99
LEARNING_RATE = 1e-3
BATCH_SIZE = 64
BUFFER_SIZE = 10000
EPSILON_DECAY = 0.995
MIN_EPSILON = 0.01
TARGET_UPDATE = 10

def train(env, episodes=1000):
    obs_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    policy_net = QNetwork(obs_dim, action_dim)
    target_net = QNetwork(obs_dim, action_dim)
    target_net.load_state_dict(policy_net.state_dict())
    target_net.eval()
    
    optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)
    replay_buffer = ReplayBuffer(BUFFER_SIZE)
    epsilon = np.ones(episodes)
    total_reward = np.zeros(episodes)
    
    for episode in range(episodes):
        obs, _ = env.reset()
        obs = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
        done = False
        
        while not done:
            if random.random() < epsilon[episode]:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    action = policy_net(obs).argmax().item()
            
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            next_obs = torch.tensor(next_obs, dtype=torch.float32).unsqueeze(0)
            replay_buffer.push(obs, action, reward, next_obs, done)
            
            obs = next_obs
            total_reward[episode] += reward
            
            if len(replay_buffer) >= BATCH_SIZE:
                batch = replay_buffer.sample(BATCH_SIZE)
                obss, actions, rewards, next_obss, dones = zip(*batch)
                
                obss = torch.cat(obss)
                actions = torch.tensor(actions, dtype=torch.int64).unsqueeze(1)
                rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1)
                next_obss = torch.cat(next_obss)
                dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1)
                
                q_values = policy_net(obss).gather(1, actions)
                next_q_values = target_net(next_obss).max(1, keepdim=True)[0]
                target_q_values = rewards + GAMMA * next_q_values * (1 - dones)
                
                loss = F.mse_loss(q_values, target_q_values.detach())
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
        
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())
        
        if episode < episodes - 1:
            epsilon[episode+1] = max(MIN_EPSILON, epsilon[episode] * EPSILON_DECAY)
    
    return policy_net, total_reward, epsilon

We can now train the network to over 1000 episodes.

import gymnasium as gym

env = gym.make('LunarLander-v3', render_mode='rgb_array')
trained_policy, nn_training_rewards, epsilon = train(env)

/usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists

And plot the reward trend as a function of the episode. Note the slight upward trend over the training time.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.plot(nn_training_rewards)
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
plt.show()

../_images/cc89ed1efad50ce4537ac0b1eb981e99e4dc9001a8b1aca2ac415d68c5de03e7.png

We can now use this trained policy over another 500 random episodes to see how it compares to the modulo and logical policies.

nn_rewards = np.zeros((500))
render = None
for repeat in range(nn_rewards.shape[0]):
    current_rewards = 0
    obs = env.reset()[0]
    current_render = []
    for step in range(env.spec.max_episode_steps):
        action = trained_policy(torch.tensor(obs, dtype=torch.float32).unsqueeze(0)).argmax().item()
        obs, reward, terminated, truncated, info = env.step(action)
        current_rewards += reward
        current_render.append(env.render())
        if terminated:
            break
    if current_rewards > nn_rewards.max():
        render = current_render
    nn_rewards[repeat] = current_rewards

    env.close()

total_rewards = np.loadtxt('total_rewards.txt')

fig, ax = plt.subplots()

ax.hist(total_rewards[0], label='Modulo Policy', density=True)
ax.hist(total_rewards[1], label='Logical Policy', density=True)
ax.hist(nn_rewards, label='Neural Network Policy', density=True)
ax.legend()
ax.set_xlabel('Reward')
ax.set_ylabel('p(Reward)')
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 1
----> 1 total_rewards = np.loadtxt('total_rewards.txt')
      3 fig, ax = plt.subplots()
      5 ax.hist(total_rewards[0], label='Modulo Policy', density=True)

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/npyio.py:1373, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
   1370 if isinstance(delimiter, bytes):
   1371     delimiter = delimiter.decode('latin1')
-> 1373 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
   1374             converters=converters, skiplines=skiprows, usecols=usecols,
   1375             unpack=unpack, ndmin=ndmin, encoding=encoding,
   1376             max_rows=max_rows, quote=quotechar)
   1378 return arr

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/npyio.py:992, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
    990     fname = os.fspath(fname)
    991 if isinstance(fname, str):
--> 992     fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
    993     if encoding is None:
    994         encoding = getattr(fh, 'encoding', 'latin1')

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/_datasource.py:193, in open(path, mode, destpath, encoding, newline)
    156 """
    157 Open `path` with `mode` and return the file object.
    158 
   (...)    189 
    190 """
    192 ds = DataSource(destpath)
--> 193 return ds.open(path, mode, encoding=encoding, newline=newline)

File /usr/share/miniconda/envs/special-topics/lib/python3.11/site-packages/numpy/lib/_datasource.py:533, in DataSource.open(self, path, mode, encoding, newline)
    530     return _file_openers[ext](found, mode=mode,
    531                               encoding=encoding, newline=newline)
    532 else:
--> 533     raise FileNotFoundError(f"{path} not found.")

FileNotFoundError: total_rewards.txt not found.

We can see that with no hyperparameter optimisation, the neural network policy is, on average, outperforming the other two approaches.