Long Short-Term Memory Networks#
Another popular network type for sequential data is long short-term memory (LSTM) networks. These networks were explicitly designed to address some of the limitations of recurrent neural networks.
Traditional recurrent neural networks suffer from what is known as the vanishing gradient. The vanishing gradient leads to long-term dependencies fading over the span of the data. This fading occurs as the backpropagation process with the activation function for recurrent neural networks leads to the gradient shrinking exponentially as it passes through the data. This leads to recurrent neural networks being considered as having short-term memories.
The long short-term memory networks include what is known as memory cells to store, update, and forget information using gates selectively:
Forget Gate: what information should be discarded?
Input Gate: what new information should be stored?
Output Gate: Controls the output based on the memory and hidden state.
Bringing together these gates prevents information from being lost over long data sequences.
A Single Perceptron#
A single LSTM perceptron can be built from the constituent gates. For each data point, the LSTM updates the hidden state, \(h_i\), and the cell state, $c_i, as follows:
First, a forget gate is used to determine what should be discarded,
\[ f_i = f(W_f x_i + U_f h_{i-1} + b_f), \]where \(f\) is a logistic function, and \(W_f\), \(U_f\) and \(b_f\), are parameters of the forget gate (hence the \(f\) subscript).
An input gate is concurrently used to select what should be stored,
\[ n_i = f(W_nx_i + U_nh_{i-1} + b_n). \]These are both used to update the cell state, \(c_i\),
\[ \tilde{c}_i = \tanh(W_cx_i + U_ch_{i-1} + b_c), \]\[ c_i = f_i \odot c_{i-1} + n_i \odot \tilde{c}_i, \]where \(\odot\) indicates elementwise multiplication, known as the Hadamard product.
The output gate is then used to determine the final output, \(o_i\), and update the hidden state,
\[ o_i = f(W_ox_i + U_o h_{i-1} + b_o), \]\[ h_i = o_i \odot \tanh(c_i). \]
Let’s put this together in a Python class.
import numpy as np
def logistic(x):
"""
Computes the logistic sigmoid function element-wise.
:param x: Input to the sigmoid function.
:return: Element-wise sigmoid of x.
"""
return 1 / (1 + np.exp(-x))
class LSTMPerceptron:
"""
A simple LSTM perceptron that processes a sequence of inputs.
:param input_size: The size of the input vectors.
:param hidden_size: The size of the hidden state vectors.
"""
def __init__(self, input_size, hidden_size):
self.hidden_size = hidden_size
self.W_f = np.random.randn(hidden_size, input_size) * 0.1
self.U_f = np.random.randn(hidden_size, hidden_size) * 0.1
self.b_f = np.zeros((hidden_size, 1))
self.W_n = np.random.randn(hidden_size, input_size) * 0.1
self.U_n = np.random.randn(hidden_size, hidden_size) * 0.1
self.b_n = np.zeros((hidden_size, 1))
self.W_c = np.random.randn(hidden_size, input_size) * 0.1
self.U_c = np.random.randn(hidden_size, hidden_size) * 0.1
self.b_c = np.zeros((hidden_size, 1))
self.W_o = np.random.randn(hidden_size, input_size) * 0.1
self.U_o = np.random.randn(hidden_size, hidden_size) * 0.1
self.b_o = np.zeros((hidden_size, 1))
self.h_i = np.zeros((hidden_size, 1))
self.c_i = np.zeros((hidden_size, 1))
def step(self, x_i):
"""
Processes a single time step of input x_i.
:param x_i: Input vector at time step i.
:return: Hidden state vector at time step i.
"""
x_i = x_i.reshape(-1, 1)
f_i = logistic(np.dot(self.W_f, x_i) + np.dot(self.U_f, self.h_i) + self.b_f)
n_i = logistic(np.dot(self.W_n, x_i) + np.dot(self.U_n, self.h_i) + self.b_n)
o_i = logistic(np.dot(self.W_o, x_i) + np.dot(self.U_o, self.h_i) + self.b_o)
c_tilde_i = np.tanh(np.dot(self.W_c, x_i) + np.dot(self.U_c, self.h_i) + self.b_c)
self.c_i = f_i * self.c_i + n_i * c_tilde_i
self.h_i = o_i * np.tanh(self.c_i)
return self.h_i
lstm = LSTMPerceptron(input_size=1, hidden_size=5)
We will apply it to the same data as the recurrent neural network.
import pandas as pd
timeseries = pd.read_csv('../data/electric.csv')
timeseries['date'] = pd.to_datetime(timeseries['date'])
timeseries_2017 = timeseries['elec-prod'][timeseries['date'].dt.year == 2017]
timeseries_2017 /= timeseries_2017.max()
for i, x_i in enumerate(timeseries_2017):
h_i = lstm.step(np.array([x_i]))
print(f"Step {i+1}: Hidden State = {h_i.ravel()}")
Step 1: Hidden State = [-0.0092979 0.00460023 -0.0060376 0.01460878 0.04731623]
Step 2: Hidden State = [-0.01509705 0.00726388 -0.0071582 0.02177839 0.06277836]
Step 3: Hidden State = [-0.01867985 0.0089761 -0.00744923 0.02647005 0.07067719]
Step 4: Hidden State = [-0.01997683 0.00949786 -0.00680705 0.02766413 0.06918647]
Step 5: Hidden State = [-0.0206425 0.00988484 -0.00670873 0.02888141 0.07018736]
Step 6: Hidden State = [-0.02171067 0.01051579 -0.00711642 0.03090982 0.0746179 ]
Step 7: Hidden State = [-0.02318524 0.01133451 -0.00768074 0.03343673 0.08046245]
Step 8: Hidden State = [-0.02411137 0.01174446 -0.0076631 0.03452519 0.08169245]
Step 9: Hidden State = [-0.02393302 0.01156288 -0.00713726 0.03382937 0.07821083]
Step 10: Hidden State = [-0.0232123 0.01119903 -0.00672175 0.03275804 0.07478593]
Step 11: Hidden State = [-0.02287214 0.01109847 -0.00677561 0.03258892 0.07477654]
Step 12: Hidden State = [-0.02384609 0.01174858 -0.00762672 0.0347723 0.08161606]
You will notice significantly more parameters in the LSTM perceptron than in a recurrent neural network. This leads to an increase in the computational cost of using an LSTM network.
Implementation with pytorch
#
Like the recurrent neural network, the LSTM network is implemented in pytorch
.
Again, you should study the documentation for the implementation.