An Empirical Exploration of Recurrent Network Architectures: Paper Notes

This post contains my notes about the paper titled “An Empirical Exploration of Recurrent Network Architectures”. You can read the paper here – http://proceedings.mlr.press/v37/jozefowicz15.pdf

Abstract of the paper – 

The Recurrent Neural Network (RNN) is an extremely powerful sequence model that is often difficult to train. The Long Short-Term Memory (LSTM) is a specific RNN architecture whose design makes it much easier to train. While wildly successful in practice, the LSTM’s architecture appears to be ad-hoc so it is not clear if it is optimal, and the significance of its individual components is unclear.

In this work, we aim to determine whether the LSTM architecture is optimal or whether much better architectures exist. We conducted a thorough architecture search where we evaluated over ten thousand different RNN architectures, and identified an architecture that outperforms both the LSTM and the recently-introduced Gated Recurrent Unit (GRU) on some but not all tasks. We found that adding a bias of 1 to the LSTM’s forget gate closes the gap between the LSTM and the GRU.

I have quoted above the abstract of the paper directly.

Introduction –

LSTM is resistant to vanishing gradient problem.Second order optimization techniques, regularization of RNN weights and RNN initialization are other way to handle the vanishing gradient problem in RNNs. However because LSTM are easy to use, they became standard way of dealing with vanishing gradient problem.

A criticism of the LSTM architecture is that it is ad-hoc and that it has a substantial number of components whose purpose is not immediately apparent. As a result, it is also not clear that the LSTM is an optimal architecture, and it is possible that better architectures exist.

Compared to GRU(and similar architectures) and vanilla LSTM, an LSTM variant achieved the best results whenever dropout was used. In addition when a bias of 1 is added to forget gate. We can close the gap between the LSTM and the better architectures. Thus, we recommend to increase the bias to the forget gate before attempting to use more sophisticated approaches.

By performing ablative experiments, it was discovered that the input gate is important, that the output gate is unimportant, and that the forget gate is extremely significant on all problems except language modelling.

Section 2 –

Because of iterative nature, RNNs suffer from both exploding and vanishing gradients.

The exploding gradient problem can be handled by gradient clipping. However, if this is done too frequently by a massive factor then the learning would suffer. Gradient clipping is extremely effective whenever the gradient has a small norm the majority of the time.

The vanishing gradient is more challenging because it does not cause the gradient itself to be small; while the gradient’s component in directions that correspond to long-term dependencies is small, while the gradient’s component in directions that correspond to short-term dependencies is large. As a result, RNNs can easily learn the short-term but not the long-term dependencies.

The LSTM addresses the vanishing gradient problem by reparameterizing the RNN. Thus, while the LSTM does not have a representational advantage, its gradient cannot vanish.

just like a tanh-based network has better-behaved gradients than a sigmoid-based network, the gradients of an RNN that computes ∆St are nicer as well, since they cannot vanish.

GRU outperformed the LSTM on nearly all tasks except language modelling with the naive initialization, but also that the LSTM nearly matched the GRU’s performance once its forget gate bias was initialized to 1.

Summary –

We report the performance of a baseline Tanh RNN, the GRU, the LSTM, the LSTM where the forget gate bias is set to 1 (LSTM-b), and the three best architectures discovered by the search procedure (named MUT1, MUT2, and MUT3). We also evaluated an LSTM without input gates (LSTM-i), an LSTM without output gates (LSTM-o), and an LSTM without forget gates (LSTM-f).

Our findings are summarized below:

  • The GRU outperformed the LSTM on all tasks with the exception of language modelling
  • MUT1 matched the GRU’s performance on language modelling and outperformed it on all other tasks
  • The LSTM significantly outperformed all other architectures on PTB language modelling when dropout was allowed
  • The LSTM with the large forget bias outperformed both the LSTM and the GRU on almost all tasks
  • While MUT1 is the best performer on two of the music datasets, it is notable that the LSTM-i and LSTM- o achieve the best results on the music dataset when dropout is used.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s