Here's a pictorial view: If we see a scene of a beach, we should boost beach activities in future frames: If we remember that Bob just arrived at a supermarket, then even without any distinctive supermarket features, an image of Bob holding a slab of bacon should probably be categorized as shopping instead of cooking.

So what we'd like is to let our model track the state of the world: After seeing each image, the model outputs a label and also updates the knowledge it's been learning. For example, the model might learn to automatically discover and track information like location are scenes currently in a house or beach?

Importantly, just as a neural network automatically discovers hidden patterns like edges, shapes, and faces without being fed them, our model should automatically discover useful information by itself. When given a new image, the model should incorporate the knowledge it's gathered to do a better job.

This, then, is a recurrent neural network. Instead of simply taking an image and returning an activity, an RNN also maintains internal memories about the world weights assigned to different pieces of information to help perform its classifications.

Mathematically So let's add the notion of internal knowledge to our equations, which we can think of as pieces of information that the network maintains over time. But this is easy: This gives us our RNN equations: So far, we've placed no constraints on this update, so its knowledge can change pretty chaotically: Or perhaps it has a wealth of information to suggest that Alice is an investment analyst, but decides she's a professional assassin after seeing her cook.

This chaos means information quickly transforms and vanishes, and it's difficult for the model to keep a long-term memory.

So what we'd like is for the network to learn how to update its beliefs scenes without Bob shouldn't change Bob-related information, scenes with Alice should focus on gathering details about herin a way that its knowledge of the world evolves more gently.

This is how we do it. Adding a forgetting mechanism. If a scene ends, for example, the model should forget the current scene location, the time of day, and reset any scene-specific information; however, if a character dies in the scene, it should continue remembering that he's no longer alive.

Adding a saving mechanism. When the model sees a new image, it needs to learn whether any information about the image is worth using and saving.

Maybe your mom sent you an article about the Kardashians, but who cares? So when new a input comes in, the model first forgets any long-term information it decides it no longer needs. Then it learns which parts of the new input are worth using, and saves them into its long-term memory.

Focusing long-term memory into working memory. Finally, the model needs to learn which parts of its long-term memory are immediately useful. For example, Bob's age may be a useful piece of information to keep in the long term children are more likely to be crawling, adults are more likely to be workingbut is probably irrelevant if he's not in the current scene.

So instead of using the full long-term memory all the time, it learns which parts to focus on instead.

This, then, is an long short-term memory network. Whereas an RNN can overwrite its memory at each time step in a fairly uncontrolled fashion, an LSTM transforms its memory in a very precise way: This helps it keep track of information over longer periods of time. We'll start with our long-term memory.

