Stacked Denoising Autoencoders

In this article by John Hearty, author of the book Advanced Machine Learning with Python, we discuss autoencoders as valuable tools in themselves, significant accuracy can be obtained by stacking autoencoders to form a deep network. This is achieved by feeding the representation created by the encoder on one layer into the next layer's encoder as input to that layer.

(For more resources related to this topic, see here.)

Stacked DenoisingAutoencoders(SdA) are currently in use in many leading data science teams for sophisticated natural language analyses as well as a broad range of signals, images, and text analyses.

The implementation of SdA will be very familiar after the previous chapter's discussion of deep belief networks. The SdA is usedin much the same way as the RBMs in our deep belief networks were used. Each layer of the deep architecture will have a dA and sigmoid component, with the autoencoder component being used to pretrain the sigmoid network. The performance measure used by anSdA is the training set error with an intensive period of layer-to-layer (layer-wise) pretraining used to gradually align network parameters before a final period of fine-tuning. During fine-tuning, the network is trained using validation and test data, over fewer epochs but with larger update steps. The goal is to have the network converge at the end of the fine-tuning to deliver an accurate result.

In addition to delivering on the typical advantages of deep networks (the ability to learn feature representations for complex or high-dimensional datasets and train a model without extensive feature engineering), stacked autoencoders have an additional, very interesting property.

Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. Successive layers of anSdA may learn increasingly high-level features. While the first layer might learn some first-order features from input data (such as learning edges in a photo image), a second layer may learn some grouping of first-order features (for instance, by learning given configurations of edges that correspond to contours or structural elements in the input image).

There's no golden rule to determine how many layers or how large layers should be for a given problem. The best solution is usually to experiment with these model parameters until you find an optimal point. This experimentation is best done with a hyperparameter optimization technique or genetic algorithm (subjects we'll discuss in later chapters of this book).

Higher layers may learn increasingly high-order configurations, enabling anSdA to learn to recognise facial features, alphanumerical characters, or the generalised forms of objects (such as a bird). This is what gives SdAs their unique capability to learn very sophisticated, high-level abstractions of their input data.

Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack autoencoders can improve the effectiveness of the deep architecture (with the main constraint becoming computing cost in time). In this chapter, we'll look at stacking three autoencoders to solve a natural language processing challenge.

Applying SdA

Now that we've had a chance to understand the advantages and power of the SdA as a deep learning architecture, let's test our skills on a real-world dataset.

For this chapter, let's step away from image datasets and work with the OpinRank Review Dataset, a text dataset of around 259,000 hotel reviews from TripAdvisor,which is accessible via the UCI Machine Learning dataset Repository. This freely-available dataset provides review scores (as floating point numbers from 1 to 5) and review text for a broad range of hotels; we'll be applying our SdA to attempt to identify the scoring of each hotel from its review text.

We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we prepare text data in an upcoming chapter. The source data is available at

In order to get started, we're going to need anSdA (hereafter SdA) class!




def __init__(





hidden_layers_sizes=[500, 500],


corruption_levels=[0.1, 0.1]


As we previously discussed, the SdA is created by feeding the encoding from one layer's autoencoder as the input to the subsequent layer. This class supports the configuration of the layer count (reflected in, but not set by, the length of the hidden_layers_sizes and corruption_levels vectors). It also supports differentiated layer sizes (in nodes) at each layer, which can be set using hidden_layers_sizes. As we discussed, the ability to configure successive layers of the autoencoder is critical to developing successful representations.

Next, we need parameters to store the MLP (self.sigmoid_layers) and dA (self.dA_layers) elements of the SdA. In order to specify the depth of our architecture, we use the self.n_layers parameter to specify the number of sigmoid and dA layers required:

self.sigmoid_layers = []

self.dA_layers = []

self.params = []

self.n_layers = len(hidden_layers_sizes)


assertself.n_layers> 0

Next, we need to construct our sigmoid and dAlayers. We begin by setting the hidden layer size to be set either from the input vector size or by the activation of the preceding layer. Following this, sigmoid_layers and dA_layers components are created with the dA layer drawing from the dA class we discussed earlier in this article:

for i in xrange(self.n_layers):

if i == 0:

input_size = n_ins


input_size = hidden_layers_sizes[i - 1]


if i == 0:

layer_input = self.x


layer_input = self.sigmoid_layers[-1].output


sigmoid_layer = HiddenLayer(rng=numpy_rng,








dA_layer = dA(numpy_rng=numpy_rng,








Having implemented the layers of our SdA, we'll need a final, logistic regression layer to complete the MLP component of the network:

self.logLayer = LogisticRegression(







self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)

self.errors = self.logLayer.errors(self.y)

This completes the architecture of our SdA. Next up, we need to generate the training functions used by the SdA class. Each function will have the minibatch index (index) as an argument, together with several other elements; corruption_level and learning_rate are enabled here so that we can adjust them (for example, gradually increase or decrease them) during training. Additionally, we identify variables that help identify where the batch starts and ends: batch_begin and batch_end, respectively.

defpretraining_functions(self, train_set_x, batch_size):

index = T.lscalar('index') 

corruption_level = T.scalar('corruption') 

learning_rate = T.scalar('lr') 

batch_begin = index * batch_size

batch_end = batch_begin + batch_size


pretrain_fns = []


cost, updates = dA.get_cost_updates(corruption_level,


fn = theano.function(



theano.Param(corruption_level, default=0.2),

theano.Param(learning_rate, default=0.1)





self.x: train_set_x[batch_begin: batch_end]






The ability to dynamically adjust the learning rate particularly is very helpful and may be applied in one of two ways. Once a technique has begun to converge on an appropriate solution, it is very helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in which the network oscillates between values located around the optimum, without ever hitting it. In some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the error rate is high, it can make sense to make larger adjustments until the error rate begins to decrease!

The pretraining function we've created takes the minibatch index and can optionally take the corruption level or learning rate. It performs one step of pretraining and outputs the cost value and vector of weight updates.

In addition to pretraining, we need to build functions to support the fine-tuning stage, where the network is run iteratively over the validation and test data to optimize network parameters. The train_fn implements a single step of fine-tuning. The valid_score is a Python function that computes a validation score using the error measure produced by the SdA over validation data. Similarly, test_score computes the error score over test data.

To get this process off the ground, we first need to set up training, validation, and test datasets. Each stage requires two datasets (set x and set y), containing the features and class labels, respectively. The required number of minibatches for validation and test is determined, and an index is created to track batch size (and provide a means of identifying at which entries a batch starts and ends). Training, validation, and testing occurs for each batch and afterward, both valid_score and test_score are calculated across all batches:

defbuild_finetune_functions(self, datasets, batch_size, learning_rate):


        (train_set_x, train_set_y) = datasets[0]

        (valid_set_x, valid_set_y) = datasets[1]

        (test_set_x, test_set_y) = datasets[2]


n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]

n_valid_batches /= batch_size

n_test_batches = test_set_x.get_value(borrow=True).shape[0]

n_test_batches /= batch_size


index = T.lscalar('index') 



gparams = T.grad(self.finetune_cost, self.params)



updates = [

            (param, param - gparam * learning_rate)

forparam, gparamin zip(self.params, gparams)



train_fn = theano.function(





self.x: train_set_x[

index * batch_size: (index + 1) * batch_size


self.y: train_set_y[

index * batch_size: (index + 1) * batch_size






test_score_i = theano.function(




self.x: test_set_x[

index * batch_size: (index + 1) * batch_size


self.y: test_set_y[

index * batch_size: (index + 1) * batch_size






valid_score_i = theano.function(




self.x: valid_set_x[

index * batch_size: (index + 1) * batch_size


self.y: valid_set_y[

index * batch_size: (index + 1) * batch_size








return [valid_score_i(i) for i inxrange(n_valid_batches)]



return [test_score_i(i) for i inxrange(n_test_batches)]


returntrain_fn, valid_score, test_score

With the training functionality in place, the following code initiates our SdA:

numpy_rng = numpy.random.RandomState(89677)

print '... building the model'

sda = SdA(



hidden_layers_sizes=[240, 170, 100],



It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see how we do. In this case, the layer sizes used here are the product of some initial testing. As we discussed, training the SdA occurs in two stages. The first is a layer-wise pretraining process that loops over all of the SdA's layers. The second is a process of fine-tuning over validation and test data.

To pretrain the SdA, we provide the required corruption levels to train each layer and iterate over the layers using our previously-defined pretraining_fns:

print '... getting the pretraining functions'

pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x,



print '... pre-training the model'

start_time = time.clock()

corruption_levels = [.1, .2, .2]

for i inxrange(sda.n_layers):


for epoch inxrange(pretraining_epochs):

            c = []





print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch),



end_time = time.clock()


print>>sys.stderr, ('The pretraining code for file ' +

os.path.split(__file__)[1] +

' ran for %.2fm' % ((end_time - start_time) / 60.))

At this point, we're able to initialize our SdA class via calling the preceding code stored within this book's GitHub repository, MasteringMLWithPython/Chapter3/

Assessing SdA performance

The SdA will take a significant length of time to run. With 15 epochs per layer and each layer typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern desktop system with GPU acceleration and a single-threaded GotoBLAS.

On a system without GPU acceleration, the network will take substantially longer to train and it is recommended that you use the alternative, which runs over a significantly smaller input dataset, MasteringMLWithPython/Chapter3/

The results are of a high quality, with a validation error score of 3.22% and test error score of 3.14%. These results are particularly impressive given the ambiguous and sometimes challenging nature of natural language processing applications.

It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than for the intermediate levels. This is largely due to the ambiguous nature of unpolarized or unemotional language.

Part of the reason that this input data was classifiable well was via significant feature engineering. While time-consuming and sometimes problematic, we've seen that well-executed feature engineering combined with an optimized model can deliver an excellent level of accuracy.


In this article, we introduced the autoencoder, an effective dimensionality reduction technique with some unique applications. We focused on the theory behind the SdA, an extension of autoencoders whereby any numbers of autoencoders are stacked in a deep architecture.

Resources for Article:

Further resources on this subject:

You've been reading an excerpt of:

Advanced Machine Learning with Python

Explore Title
comments powered by Disqus