search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Comprehensive Guide on LSTM

schedule Aug 12, 2023
Last updated
local_offer
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

A comparison between a RNN and LSTM layer:

The relationship between $\mathbf{c}_t$ and $\mathbf{h}_t$ is as follows:

$$\mathbf{h}_t=\mathrm{tanh}(\mathbf{c}_t)$$

Here, we are applying $\mathrm{tanh}$ to each element in vector $\mathbf{c}_t$. This means that the size of $\mathbf{c}_t$ and $\mathbf{h}_t$ is the same, that is, if $\mathbf{c}_t$ had 100 elements, then $\mathbf{h}_t$ would also have 100 elements.

Anatomy of LSTM layer

Output gate

The output gate governs how much information is passed on to the next layer as $\mathbf{h}_t$.

$$\mathbf{o}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})$$

Where,

  • $\sigma$ is the sigmoid function

  • $\mathbf{W}^\mathbf{o}_\mathbf{x}$ is the weights dedicated to the output gate

Since all values in the vector passes through the sigmoid function $\sigma$, the output vector $\mathbf{o}$ holds values between 0 and 1, where 0 implies no information should be passed, whereas 1 implies all information should be passed.

The hidden state vector $\mathbf{h}_t$ will be computed as follows:

$$\mathbf{h}_t=\mathbf{o}\;\odot\;\mathrm{tanh}(\mathbf{c}_t)$$

Here, the operation $\odot$ represents element-wise multiplication, which is often known as the Hadamard product.

Forget gate

The forget gate is used to "forget" or discard unneeded information from $\mathbf{c}_{t-1}$. We can obtain the output of the forget gate, say $\mathbf{f}$, like so:

$$\mathbf{f}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})$$
$$\mathbf{c}_t=\mathbf{f}\odot\mathbf{c}_{t-1}$$

Just like for the other gates, the value of $\mathbf{f}$ changes at each time step. This is important to keep in mind when we explore how back-propagation works in LSTM.

Memory cell

If we only have the forget gate, then the network is only capable of forgetting information. In order for the network to remember new information, we introduce memory cell.

$$\mathbf{g}=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})$$

This is not a gate because we are not using a sigmoid function here - instead we use the $\mathrm{tanh}$ curve in order to encode new information.

Input gate

The input gate governs whether the to-be added information is valuable or not. Without the input gate, then the network would be adding any new information - the input gate allows us to gain only new information that will benefit our cause.

$$\mathbf{i}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})$$

Again, since this is a gate, we use the sigmoid function.

Summary

The gates and new-memory cell of our LSTM layer are as follows:

$$\begin{align*} \mathbf{f}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})\\ \mathbf{g}&=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})\\ \mathbf{i}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})\\ \mathbf{o}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})\\ \end{align*}$$

The new cell vector and hidden state vector are as follows:

$$\begin{align*} \mathbf{c}_t&=\mathbf{f}\odot\mathbf{c}_{t-1}+\mathbf{g}\odot\mathbf{i}\\ \mathbf{h}_t&=\mathbf{o}\odot\mathrm{tanh}(\mathbf{c}_t) \end{align*}$$

Speeding up computation

To speed up computation, we can actually bundle the four affine transformation above. We do this by combining the weight matrix in order to create one giant weight matrix:

$$\mathbf{x}_t\mathbf{W}_\mathbf{x}+\mathbf{h}_{t-1}\mathbf{W}_h+\mathbf{b}$$

Where:

$$\begin{align*} \mathbf{W}_\mathbf{x}&=\begin{pmatrix} \mathbf{W}_\mathbf{x}^{(\mathbf{f})}& \mathbf{W}_\mathbf{x}^{(\mathbf{g})}& \mathbf{W}_\mathbf{x}^{(\mathbf{i})}& \mathbf{W}_\mathbf{x}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{W}_\mathbf{h}&=\begin{pmatrix} \mathbf{W}_\mathbf{h}^{(\mathbf{f})}& \mathbf{W}_\mathbf{h}^{(\mathbf{g})}& \mathbf{W}_\mathbf{h}^{(\mathbf{i})}& \mathbf{W}_\mathbf{h}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{b}&=\begin{pmatrix} \mathbf{b}^{(\mathbf{f})}& \mathbf{b}^{(\mathbf{g})}& \mathbf{b}^{(\mathbf{i})}& \mathbf{b}^{(\mathbf{o})} \end{pmatrix} \end{align*}$$

The reason why this is faster than performing 4 separate matrix operation is that CPU and GPU are better at computing matrices in one-go.

Deep LSTM

One prominent way of increasing the performance of LSTM is to use multiple LSTM layers:

Here, each LSTM has its own memory cell c, and this is kept within in its own LSTM layer.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!