search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook
thumb_up
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Comprehensive Guide on LSTM

Machine Learning
chevron_right
Neural Networks
schedule Mar 10, 2022
Last updated
local_offer
Tags

A comparison between a RNN and LSTM layer:

The relationship between $\mathbf{c}_t$ and $\mathbf{h}_t$ is as follows:

$$\mathbf{h}_t=\mathrm{tanh}(\mathbf{c}_t)$$

Here, we are applying $\mathrm{tanh}$ to each element in vector $\mathbf{c}_t$. This means that the size of $\mathbf{c}_t$ and $\mathbf{h}_t$ is the same, that is, if $\mathbf{c}_t$ had 100 elements, then $\mathbf{h}_t$ would also have 100 elements.

Anatomy of LSTM layer

Output gate

The output gate governs how much information is passed on to the next layer as $\mathbf{h}_t$.

$$\mathbf{o}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})$$

Where,

  • $\sigma$ is the sigmoid function

  • $\mathbf{W}^\mathbf{o}_\mathbf{x}$ is the weights dedicated to the output gate

Since all values in the vector passes through the sigmoid function $\sigma$, the output vector $\mathbf{o}$ holds values between 0 and 1, where 0 implies no information should be passed, whereas 1 implies all information should be passed.

The hidden state vector $\mathbf{h}_t$ will be computed as follows:

$$\mathbf{h}_t=\mathbf{o}\;\odot\;\mathrm{tanh}(\mathbf{c}_t)$$

Here, the operation $\odot$ represents element-wise multiplication, which is often known as the Hadamard product.

Forget gate

The forget gate is used to "forget" or discard unneeded information from $\mathbf{c}_{t-1}$. We can obtain the output of the forget gate, say $\mathbf{f}$, like so:

$$\mathbf{f}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})$$
$$\mathbf{c}_t=\mathbf{f}\odot\mathbf{c}_{t-1}$$

Just like for the other gates, the value of $\mathbf{f}$ changes at each time step. This is important to keep in mind when we explore how back-propagation works in LSTM.

Memory cell

If we only have the forget gate, then the network is only capable of forgetting information. In order for the network to remember new information, we introduce memory cell.

$$\mathbf{g}=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})$$

This is not a gate because we are not using a sigmoid function here - instead we use the $\mathrm{tanh}$ curve in order to encode new information.

Input gate

The input gate governs whether the to-be added information is valuable or not. Without the input gate, then the network would be adding any new information - the input gate allows us to gain only new information that will benefit our cause.

$$\mathbf{i}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})$$

Again, since this is a gate, we use the sigmoid function.

Summary

The gates and new-memory cell of our LSTM layer are as follows:

$$\begin{align*} \mathbf{f}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})\\ \mathbf{g}&=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})\\ \mathbf{i}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})\\ \mathbf{o}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})\\ \end{align*}$$

The new cell vector and hidden state vector are as follows:

$$\begin{align*} \mathbf{c}_t&=\mathbf{f}\odot\mathbf{c}_{t-1}+\mathbf{g}\odot\mathbf{i}\\ \mathbf{h}_t&=\mathbf{o}\odot\mathrm{tanh}(\mathbf{c}_t) \end{align*}$$

Speeding up computation

To speed up computation, we can actually bundle the four affine transformation above. We do this by combining the weight matrix in order to create one giant weight matrix:

$$\mathbf{x}_t\mathbf{W}_\mathbf{x}+\mathbf{h}_{t-1}\mathbf{W}_h+\mathbf{b}$$

Where:

$$\begin{align*} \mathbf{W}_\mathbf{x}&=\begin{pmatrix} \mathbf{W}_\mathbf{x}^{(\mathbf{f})}& \mathbf{W}_\mathbf{x}^{(\mathbf{g})}& \mathbf{W}_\mathbf{x}^{(\mathbf{i})}& \mathbf{W}_\mathbf{x}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{W}_\mathbf{h}&=\begin{pmatrix} \mathbf{W}_\mathbf{h}^{(\mathbf{f})}& \mathbf{W}_\mathbf{h}^{(\mathbf{g})}& \mathbf{W}_\mathbf{h}^{(\mathbf{i})}& \mathbf{W}_\mathbf{h}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{b}&=\begin{pmatrix} \mathbf{b}^{(\mathbf{f})}& \mathbf{b}^{(\mathbf{g})}& \mathbf{b}^{(\mathbf{i})}& \mathbf{b}^{(\mathbf{o})} \end{pmatrix} \end{align*}$$

The reason why this is faster than performing 4 separate matrix operation is that CPU and GPU are better at computing matrices in one-go.

Deep LSTM

One prominent way of increasing the performance of LSTM is to use multiple LSTM layers:

Here, each LSTM has its own memory cell c, and this is kept within in its own LSTM layer.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...