Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Meta Learning with Python

You're reading from  Hands-On Meta Learning with Python

Product type Book
Published in Dec 2018
Publisher Packt
ISBN-13 9781789534207
Pages 226 pages
Edition 1st Edition
Languages
Author (1):
Sudharsan Ravichandiran Sudharsan Ravichandiran
Profile icon Sudharsan Ravichandiran

Table of Contents (17) Chapters

Title Page
Dedication
About Packt
Contributors
Preface
1. Introduction to Meta Learning 2. Face and Audio Recognition Using Siamese Networks 3. Prototypical Networks and Their Variants 4. Relation and Matching Networks Using TensorFlow 5. Memory-Augmented Neural Networks 6. MAML and Its Variants 7. Meta-SGD and Reptile 8. Gradient Agreement as an Optimization Objective 9. Recent Advancements and Next Steps 1. Assessments 2. Other Books You May Enjoy Index

Learning to learn gradient descent by gradient descent


Now, we will see one of the interesting meta learning algorithms called learning to learn gradient descent by gradient descent. Isn't the name kind of daunting? Well, in fact, it is one of the simplest meta learning algorithms. We know that, in meta learning, our goal is to learn the learning process. In general, how do we train our neural networks? We train our network by computing loss and minimizing the loss through gradient descent. So, we optimize our model using gradient descent. Instead of using gradient descent can we learn this optimization process automatically?

But how can we learn this? We replace our traditional optimizer (gradient descent) with the Recurrent Neural Network (RNN). But how does this work? How can we replace gradient descent with RNN? If you examine closely, what are we really doing in gradient descent? It is basically a sequence of updates from the output layer to the input layer and we store these updates in a state. So, we can use RNN and store the updates in an RNN cell.

So, the main idea of this algorithm is to replace gradient descent with RNN. But the question is how do RNNs learn? How can we optimize the RNN? For optimizing an RNN, we use gradient descent. So, in a nutshell, we are learning to perform gradient descent through an RNN and that RNN is optimized by gradient descent and that's what is meant by the name learning to learn gradient descent by gradient descent.

We call our RNN, an optimizer and our base network, an optimizee. Let's say we have a model

parameterized by some parameter

. We need to find this optimal parameter

, so that we can minimize the loss. In general, we find this optimal parameter through gradient descent, but now we use the RNN for finding this optimal parameter. So the RNN (optimizer) finds the optimal parameter and sends it to the optimizee (base network); the optimizee uses this parameter, computes the loss, and sends the loss to the RNN. Based on the loss, the RNN optimizes itself through gradient descent and updates the model parameter

.

Confusing? Look at the following diagram: our optimizee (base network) is optimized through our optimizer (RNN). The optimizer sends the updated parameters—that is, weights—to the optimizee and the optimizee uses these weights, calculates the loss, and sends the loss to the optimizer; based on the loss, the optimizer improves itself through gradient descent:

Let's say our base network (optimizee) is parameterized by

and our RNN (optimizer) is parameterized by

. What is the loss function of the optimizer? We know that the optimizer's role (RNN) is to reduce the loss of the optimizee (base network). So the loss of our optimizer is the average loss of the optimizee and it can be represented as follows:

How do we minimize this loss? We minimize this loss through gradient descent by finding the right

. Okay, what does the RNN take as input and what output would it return? Our optimizer, that is, our RNN, takes as input the gradient of optimizee

as well as its previous state

and returns output, an update

that can minimize the loss of our optimizee. Let's denote our RNN by a function

:

In the previous equation, the following applies:

  • is the gradient of our model (optimizee)
    , that is,
  • is the hidden state of the RNN
  • is the parameter for the RNN
  • Outputs
    and
    is the update and next state of the RNN respectively

So, we update our model parameter values using

.

As you can see in the following diagram, our optimizer

at a time t, takes in a hidden state

and a gradient of

as

as inputs, computes

and sends it to our optimizee, where it is added with

 and becomes

for an update at the next time step:

So, in this way, we learn the gradient descent optimization through gradient descent.

You have been reading a chapter from
Hands-On Meta Learning with Python
Published in: Dec 2018 Publisher: Packt ISBN-13: 9781789534207
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}