Transfer Learning

Graham Annett

October 07th, 2016

The premise of transfer learning is the idea that a model trained on a particular dataset can be used and applied to a different dataset. While the notion has been around for quite some time, very recently it's become useful along with Domain Adaptation as a way to use pre-trained neural networks for highly specific tasks (such as in Kaggle competitions) and various fields.

Prerequisites

For this post, I will be using Keras 1.0.3 configured with TensorFlow 0.8.0.

Simple Overview and Example

Before using VGG-16 with pre-trained weights, let’s first use a simple example on our own small net to see how it works. For this example we will be using a MNIST trained net and then fine-tuning the last layers to allow for it to predict on a dataset of smiling or not smiling images.

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils.np_utils import to_categorical
from scipy.misc import imresize


def rgb_g(img):
   grayscaled = 0.2989 * img[:,0,:,:] + 0.5870 * img[:,1,:,:] + 0.1140 * img[:,2,:,:]
   return grayscaled

(X, Y), (_, _) = cifar10.load_data()

nb_classes = len(np.unique(Y_train))

Y = np_utils.to_categorical(Y, nb_classes)
X = X.astype('float32')/255.

# converts 3 channels to 1 and resizes image
X = rgb_g(X)
X_tmp = []
for i in range(X.shape[0]):
   X_tmp.append(imresize(X[i], (28,28)))
X = np.array(X_tmp)
X = X.reshape(-1,1,28,28)


model = Sequential()
model.add(Convolution2D(32,3,3, border_mode='same', input_shape=(1,28,28)))
model.add(Activation('relu'))
model.add(Convolution2D(32,3,3, border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D((2,2)))
model.add(Dropout(.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta')

One thing to notice is that our input for the neural net is 1x28x28. This is important, because as the data we feed in must match this dimension, the MNIST and CIFAR datasets are not images of the same size or number of color channels (MNIST is 1x28x28 while CIFAR10 is 3x32x32, where the first image represents the number of channels in the image). There are a few ways to accommodate this, but generally you are working with what the prior weights and model were trained on and must resize and adjust your input accordingly (for instance, grayscaled images can be repeated from 1 channel into 3 channels to use on RGB trained models).

With this model we now will load data from MNIST and fit again, but only fine tune on the last few layers. First let’s look at the model and some of the features of the model.

> model.layers
[<keras.layers.convolutional.Convolution2D at 0x1368fe358>,
<keras.layers.core.Activation at 0x1368fe3c8>,
<keras.layers.convolutional.Convolution2D at 0x136905ba8>,
<keras.layers.core.Activation at 0x136905898>,
<keras.layers.convolutional.MaxPooling2D at 0x136930828>,
<keras.layers.core.Dropout at 0x136930860>,
<keras.layers.core.Flatten at 0x136947550>,
<keras.layers.core.Dense at 0x136973240>,
<keras.layers.core.Activation at 0x136973780>,
<keras.layers.core.Dropout at 0x13697ef98>,
<keras.layers.core.Dense at 0x136988a20>,
<keras.layers.core.Activation at 0x136e29ef0>]

> model.layers[0].trainable
True

With keras, we have the ability to specify whether we want a layer to be trainable or not. A trainable layer means that its weights that are learned via fitting the model will update. For this experiment we will be doing what is called fine tuning on only the last layer without changing the number of classes. We still want to keep the last few layers so we will set all the layers but the last 2 to be trainable such that the learned weights will stay the same:

for l in range(len(model.layers)-2):
   model.layers[l].trainable=False
model.compile(loss='categorical_crossentropy', optimizer='adadelta')

Note: we must also recompile every time we adjust the model's layers. This is oftentimes a tedious process with Theano so can be useful when initially experimenting to use TensorFlow.

Now we can train a few epochs on the MNIST dataset and see how well it's priorly learned weights work.

(X_mnist, y_mnist), (_, _) = mnist.load_data()
y_mnist = np_utils.to_categorical(y_mnist)
X_mnist = X_mnist.reshape(-1,1,28,28)

model.fit(X_mnist, y_mnist, batch_size=32, nb_epoch=5, validation_split=.2)

We can also train on the dataset but use different final layers in the model. If, for instance, you were interested in fine tuning the model based on some dataset with 1 single binary classification, you could do something like:

model.pop()
model.pop()
model.add(Dense(1))
model.add(Activation('softmax'))

model.train(x_train, y_train)

While this example is quite small and the weights are easily learned, the premise that network weights that took a few days or even weeks to learn isn't that uncommon. Also, having a large pre-trained network can be useful to both gauge your own network results as well as to incorporate into other aspects of your deep learning model.

Using VGG-16 for Transfer Learning

There are a few well known pre-trained models and weights that while plausibly you could train on your own computer, often the training time is much too long [D1] and requires specialized hardware to train. VGG-16 is perhaps one of the better known of these, but there are many others and Caffe has a nice listing of them.

Using the VGG-16 is quite simple and allows for a previously trained model that is quite adaptable without having to spend a large amount of time training. With this type of model, we are able to load the model and use these weights; then we can remove the final layers to change to for instance a binary classification problem. Using these pre-trained networks take specialized hardware usually and require the it [D2] may not work on all computers and GPU's.

You need to download the pre-trained weights available here and there is also a gist explaining the general use of it.

from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import Adam

model = Sequential()
model.add(ZeroPadding2D((1, 1), input_shape=(3, 224, 224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2)))

model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2)))

model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2)))

model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2)))

model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1, 1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2)))

model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))


model.load_weights('vgg16_weights.h5')

for l in model.layers[:-2]:
   l.trainable = False
model.layers.pop()
model.layers.pop()
model.add(Dropout(0.5))
model.add(Dense(1, activation='softmax'))

model.compile(optimizer=Adam() loss='categorical_crossentropy', metrics=['accuracy'])

You should now be able to try out your own models to experience the benefits of transfer learning.

About the author

Graham Annett is an NLP engineer at Kip.  He has been interested in deep learning for a bit over a year and has worked with and contributed to Keras.  He can be found on GitHub or Here..