This chapter presents Theano as a compute engine and the basics for symbolic computing with Theano. Symbolic computing consists of building graphs of operations that will be optimized later on for a specific architecture, using the computation libraries available for this architecture.
Although this chapter might appear to be a long way from practical applications, it is essential to have an understanding of the technology for the following chapters; what is it capable of and what value does it bring? All the following chapters address the applications of Theano when building all possible deep learning architectures.
Theano may be defined as a library for scientific computing; it has been available since 2007 and is particularly suited to deep learning. Two important features are at the core of any deep learning library: tensor operations, and the capability to run the code on CPU or Graphical Computation Unit (GPU). These two features enable us to work with a massive amount of multi-dimensional data. Moreover, Theano proposes automatic differentiation, a very useful feature that can solve a wider range of numeric optimizations than deep learning problems.
The chapter covers the following topics:
Theano installation and loading
Tensors and algebra
Symbolic programming
Graphs
Automatic differentiation
GPU programming
Profiling
Configuration
Usually, input data is represented with multi-dimensional arrays:
Images have three dimensions: The number of channels, the width, and the height of the image
Sounds and times series have one dimension: The duration
Natural language sequences can be represented by two-dimensional arrays: The duration and the alphabet length or the vocabulary length
We'll see more examples of input data arrays in the future chapters.
In Theano, multi-dimensional arrays are implemented with an abstraction class, named tensor, with many more transformations available than traditional arrays in a computer language such as Python.
At each stage of a neural net, computations such as matrix multiplications involve multiple operations on these multi-dimensional arrays.
Classical arrays in programming languages do not have enough built-in functionalities to quickly and adequately address multi-dimensional computations and manipulations.
Computations on multi-dimensional arrays have a long history of optimizations, with tons of libraries and hardware. One of the most important gains in speed has been permitted by the massive parallel architecture of the GPU, with computation ability on a large number of cores, from a few hundred to a few thousand.
Compared to the traditional CPU, for example, a quadricore, 12-core, or 32-core engine, the gains with GPU can range from 5x to 100x, even if part of the code is still being executed on the CPU (data loading, GPU piloting, and result outputting). The main bottleneck with the use of GPU is usually the transfer of data between the memory of the CPU and the memory of the GPU, but still, when well programmed, the use of GPU helps bring a significant increase in speed of an order of magnitude. Getting results in days rather than months, or hours rather than days, is an undeniable benefit for experimentation.
The Theano engine has been designed to address the challenges of multi-dimensional arrays and architecture abstraction from the beginning.
There is another undeniable benefit of Theano for scientific computation: the automatic differentiation of functions of multi-dimensional arrays, a well-suited feature for model parameter inference via objective function minimization. Such a feature facilitates experimentation by releasing the pain to compute derivatives, which might not be very complicated, but are prone to many errors.
In this section, we'll install Theano, run it on the CPU and GPU devices, and save the configuration.
The easiest way to install Theano is to use conda
, a cross-platform package and environment manager.
If conda
is not already installed on your operating system, the fastest way to install conda
is to download the miniconda
installer from https://conda.io/miniconda.html. For example, for conda under Linux 64 bit and Python 2.7
, use this command:
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh chmod +x Miniconda2-latest-Linux-x86_64.sh bash ./Miniconda2-latest-Linux-x86_64.sh
Conda enables us to create new environments in which versions of Python (2 or 3) and the installed packages may differ. The conda
root environment uses the same version of Python as the version installed on the system on which you installed conda
.
conda install theano
Run a Python session and try the following commands to check your configuration:
>>> from theano import theano >>> theano.config.device 'cpu' >>> theano.config.floatX 'float64' >>> print(theano.config)
The last command prints all the configuration of Theano. The theano.config
object contains keys to many configuration options.
To infer the configuration options, Theano looks first at the ~/.theanorc
file, then at any environment variables that are available, which override the former options, and lastly at the variable set in the code that are first in order of precedence:
>>> theano.config.floatX='float32'
Some of the properties might be read-only and cannot be changed in the code, but floatX
, which sets the default floating point precision for floats, is among the properties that can be changed directly in the code.
Theano enables the use of GPU, units that are usually used to compute the graphics to display on the computer screen.
To have Theano work on the GPU as well, a GPU backend library is required on your system.
The CUDA library (for NVIDIA GPU cards only) is the main choice for GPU computations. There is also the OpenCL standard, which is open source but far less developed, and much more experimental and rudimentary on Theano.
Most scientific computations still occur on NVIDIA cards at the moment. If you have an NVIDIA GPU card, download CUDA from the NVIDIA website, https://developer.nvidia.com/cuda-downloads, and install it. The installer will install the latest version of the GPU drivers first, if they are not already installed. It will install the CUDA library in the /usr/local/cuda
directory.
Install the cuDNN library, a library by NVIDIA, that offers faster implementations of some operations for the GPU. To install it, I usually copy the /usr/local/cuda
directory to a new directory, /usr/local/cuda-{CUDA_VERSION}-cudnn-{CUDNN_VERSION}
, so that I can choose the version of CUDA and cuDNN, depending on the deep learning technology I use and its compatibility.
In your .bashrc
profile, add the following line to set the $PATH
and $LD_LIBRARY_PATH
variables:
export PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0-cudnn-5.1/lib64:/usr/local/cuda-8.0-cudnn-5.1/lib:$LD_LIBRARY_PATH
N-dimensional GPU arrays have been implemented in Python in six different GPU libraries (Theano/CudaNdarray,PyCUDA
/ GPUArray,CUDAMAT
/ CUDAMatrix
, PYOPENCL
/GPUArray
, Clyther
, Copperhead
), are a subset of NumPy.ndarray
. Libgpuarray
is a backend library to have them in a common interface with the same property.
To install libgpuarray
with conda
, use this command:
conda install pygpu
To run Theano in GPU mode, you need to configure the config.device
variable before execution since it is a read-only variable once the code is run. Run this command with the THEANO_FLAGS
environment variable:
THEANO_FLAGS="device=cuda,floatX=float32" python >>> import theano Using cuDNN version 5110 on context None Mapped name None to device cuda: Tesla K80 (0000:83:00.0) >>> theano.config.device 'gpu' >>> theano.config.floatX 'float32'
The first return shows that GPU device has been correctly detected, and specifies which GPU it uses.
By default, Theano activates CNMeM, a faster CUDA memory allocator. An initial pre-allocation can be specified with the gpuarra.preallocate
option. At the end, my launch command will be as follows:
THEANO_FLAGS="device=cuda,floatX=float32,gpuarray.preallocate=0.8" python >>> from theano import theano Using cuDNN version 5110 on context None Preallocating 9151/11439 Mb (0.800000) on cuda Mapped name None to device cuda: Tesla K80 (0000:83:00.0)
The first line confirms that cuDNN is active, the second confirms memory pre-allocation. The third line gives the default context name (that is, None
when flag device=cuda
is set) and the model of GPU used, while the default context name for the CPU will always be cpu
.
It is possible to specify a different GPU than the first one, setting the device to cuda0
, cuda1
,... for multi-GPU computers. It is also possible to run a program on multiple GPU in parallel or in sequence (when the memory of one GPU is not sufficient), in particular when training very deep neural nets, as for classification of full images as described in Chapter 7, Classifying Images with Residual Networks. In this case, the contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3
flag activates multiple GPUs instead of one, and designates the context name to each GPU device to be used in the code. Here is an example on a 4-GPU instance:
THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3,floatX=float32,gpuarray.preallocate=0.8" python >>> import theano Using cuDNN version 5110 on context None Preallocating 9177/11471 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9177/11471 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) Using cuDNN version 5110 on context dev2 Preallocating 9177/11471 Mb (0.800000) on cuda2 Mapped name dev2 to device cuda2: Tesla K80 (0000:87:00.0) Using cuDNN version 5110 on context dev3 Preallocating 9177/11471 Mb (0.800000) on cuda3 Mapped name dev3 to device cuda3: Tesla K80 (0000:88:00.0)
To assign computations to a specific GPU in this multi-GPU setting, the names we choose, dev0
, dev1
, dev2
, and dev3
, have been mapped to each device (cuda0
, cuda1
, cuda2
, cuda3
).
This name mapping enables to write codes that are independent of the underlying GPU assignments and libraries (CUDA or others).
To keep the current configuration flags active at every Python session or execution without using environment variables, save your configuration in the ~/.theanorc
file as follows:
[global] floatX = float32 device = cuda0 [gpuarray] preallocate = 1
Now you can simply run python
command. You are now all set.
In Python, some scientific libraries such as NumPy provide multi-dimensional arrays. Theano doesn't replace Numpy, but it works in concert with it. NumPy is used for the initialization of tensors.
To perform the same computation on CPU and GPU, variables are symbolic and represented by the tensor class, an abstraction, and writing numerical expressions consists of building a computation graph of variable nodes and apply nodes. Depending on the platform on which the computation graph will be compiled, tensors are replaced by either of the following:
A
TensorType
variable, which has to be on CPUA
GpuArrayType
variable, which has to be on GPU
That way, the code can be written indifferently of the platform where it will be executed.
Here are a few tensor objects:
Object class |
Number of dimensions |
Example |
---|---|---|
|
0-dimensional array |
1, 2.5 |
|
1-dimensional array |
[0,3,20] |
|
2-dimensional array |
[[2,3][1,5]] |
|
3-dimensional array |
[[[2,3][1,5]],[[1,2],[3,4]]] |
Playing with these Theano objects in the Python shell gives us a better idea:
>>> import theano.tensor as T >>> T.scalar() <TensorType(float32, scalar)> >>> T.iscalar() <TensorType(int32, scalar)> >>> T.fscalar() <TensorType(float32, scalar)> >>> T.dscalar() <TensorType(float64, scalar)>
With i
, l
, f
, or d
in front of the object name, you initiate a tensor of a given type, integer32
, integer64
, float32
, or float64
. For real-valued (floating point) data, it is advised to use the direct form T.scalar()
instead of the f
or d
variants since the direct form will use your current configuration for floats:
>>> theano.config.floatX = 'float64'
>>> T.scalar()
<TensorType(float64, scalar)>
>>> T.fscalar()
<TensorType(float32, scalar)>
>>> theano.config.floatX = 'float32'
>>> T.scalar()
<TensorType(float32, scalar)>
Symbolic variables do either of the following:
Play the role of placeholders, as a starting point to build your graph of numerical operations (such as addition, multiplication): they receive the flow of the incoming data during the evaluation once the graph has been compiled
Represent intermediate or output results
Symbolic variables and operations are both part of a computation graph that will be compiled either on CPU or GPU for fast execution. Let's write our first computation graph consisting of a simple addition:
>>> x = T.matrix('x') >>> y = T.matrix('y') >>> z = x + y >>> theano.pp(z) '(x + y)' >>> z.eval({x: [[1, 2], [1, 3]], y: [[1, 0], [3, 4]]}) array([[ 2., 2.], [ 4., 7.]], dtype=float32)
First, two symbolic variables, or variable nodes, are created, with the names x
and y
, and an addition operation, an apply node, is applied between both of them to create a new symbolic variable, z
, in the computation graph.
The pretty print function, pp
, prints the expression represented by Theano symbolic variables. Eval
evaluates the value of the output variable, z
, when the first two variables, x
and y
, are initialized with two numerical 2-dimensional arrays.
The following example shows the difference between the variables x
and y
, and their names x
and y
:
>>> a = T.matrix()
>>> b = T.matrix()
>>> theano.pp(a + b)
'(<TensorType(float32, matrix)> + <TensorType(float32, matrix)>)'.
Without names, it is more complicated to trace the nodes in a large graph. When printing the computation graph, names significantly help diagnose problems, while variables are only used to handle the objects in the graph:
>>> x = T.matrix('x')
>>> x = x + x
>>> theano.pp(x)
'(x + x)'
Here, the original symbolic variable, named x
, does not change and stays part of the computation graph. x + x
creates a new symbolic variable we assign to the Python variable x
.
Note also that with the names, the plural form initializes multiple tensors at the same time:
>>> x, y, z = T.matrices('x', 'y', 'z')
Now, let's have a look at the different functions to display the graph.
Let's take back the simple addition example and present different ways to display the same information:
>>> x = T.matrix('x') >>> y = T.matrix('y') >>> z = x + y >>> z Elemwise{add,no_inplace}.0 >>> theano.pp(z) '(x + y) >>> theano.printing.pprint(z) '(x + y)' >>> theano.printing.debugprint(z) Elemwise{add,no_inplace} [id A] '' |x [id B] |y [id C]
Here, the debugprint
function prints the pre-compilation graph, the unoptimized graph. In this case, it is composed of two variable nodes, x
and y
, and an apply node, the elementwise addition, with the no_inplace
option. The inplace
option will be used in the optimized graph to save memory and re-use the memory of the input to store the result of the operation.
If the graphviz
and pydot
libraries have been installed, the pydotprint
command outputs a PNG image of the graph:
>>> theano.printing.pydotprint(z) The output file is available at ~/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/theano.pydotprint.gpu.png.

You might have noticed that the z.eval
command takes while to execute the first time. The reason for this delay is the time required to optimize the mathematical expression and compile the code for the CPU or GPU before being evaluated.
The compiled expression can be obtained explicitly and used as a function that behaves as a traditional Python function:
>>> addition = theano.function([x, y], [z]) >>> addition([[1, 2], [1, 3]], [[1, 0], [3, 4]]) [array([[ 2., 2.], [ 4., 7.]], dtype=float32)]
The first argument in the function creation is a list of variables representing the input nodes of the graph. The second argument is the array of output variables. To print the post compilation graph, use this command:
>>> theano.printing.debugprint(addition) HostFromGpu(gpuarray) [id A] '' 3 |GpuElemwise{Add}[(0, 0)]<gpuarray> [id B] '' 2 |GpuFromHost<None> [id C] '' 1 | |x [id D] |GpuFromHost<None> [id E] '' 0 |y [id F] >>> theano.printing.pydotprint(addition) The output file is available at ~/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/theano.pydotprint.gpu.png:

This case has been printed while using the GPU. During compilation, each operation has chosen the available GPU implementation. The main program still runs on CPU, where the data resides, but a GpuFromHost
instruction performs a data transfer from the CPU to the GPU for input, while the opposite operation, HostFromGpu
, fetches the result for the main program to display it:

Theano performs some mathematical optimizations, such as grouping elementwise operations, adding a new value to the previous addition:
>>> z= z * x >>> theano.printing.debugprint(theano.function([x,y],z)) HostFromGpu(gpuarray) [id A] '' 3 |GpuElemwise{Composite{((i0 + i1) * i0)}}[(0, 0)]<gpuarray> [id B] '' 2 |GpuFromHost<None> [id C] '' 1 | |x [id D] |GpuFromHost<None> [id E] '' 0 |y [id F]
The number of nodes in the graph has not increased: two additions have been merged into one node. Such optimizations make it more tricky to debug, so we'll show you at the end of this chapter how to disable optimizations for debugging.
Lastly, let's see a bit more about setting the initial value with NumPy:
>>> theano.config.floatX 'float32' >>> x = T.matrix() >>> x <TensorType(float32, matrix)> >>> y = T.matrix() >>> addition = theano.function([x, y], [x+y]) >>> addition(numpy.ones((2,2)),numpy.zeros((2,2))) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 786, in __call__ allow_downcast=s.allow_downcast) File "/usr/local/lib/python2.7/site-packages/theano/tensor/type.py", line 139, in filter raise TypeError(err_msg, data) TypeError: ('Bad input argument to theano function with name "<stdin>:1" at index 0(0-based)', 'TensorType(float32, matrix) cannot store a value of dtype float64 without risking loss of precision. If you do not mind this loss, you can: 1) explicitly cast your data to float32, or 2) set "allow_input_downcast=True" when calling "function".', array([[ 1., 1.], [ 1., 1.]]))
Executing the function on the NumPy arrays throws an error related to loss of precision, since the NumPy arrays here have float64
and int64
dtypes
, but x
and y
are float32
. There are multiple solutions to this; the first is to create the NumPy arrays with the right dtype
:
>>> import numpy
>>> addition(numpy.ones((2,2), dtype=theano.config.floatX),numpy.zeros((2,2), dtype=theano.config.floatX))
[array([[ 1., 1.],
[ 1., 1.]], dtype=float32)]
Alternatively, cast the NumPy arrays (in particular for numpy.diag
, which does not allow us to choose the dtype
directly):
>>> addition(numpy.ones((2,2)).astype(theano.config.floatX),numpy.diag((2,3)).astype(theano.config.floatX)) [array([[ 3., 1.], [ 1., 4.]], dtype=float32)]
Or we could allow downcasting:
>>> addition = theano.function([x, y], [x+y],allow_input_downcast=True) >>> addition(numpy.ones((2,2)),numpy.zeros((2,2))) [array([[ 1., 1.], [ 1., 1.]], dtype=float32)]
We have seen how to create a computation graph composed of symbolic variables and operations, and compile the resulting expression for an evaluation or as a function, either on GPU or on CPU.
As tensors are very important to deep learning, Theano provides lots of operators to work with tensors. Most operators that exist in scientific computing libraries such as NumPy for numerical arrays have their equivalent in Theano and have a similar name, in order to be more familiar to NumPy's users. But contrary to NumPy, expressions written with Theano can be compiled either on CPU or GPU.
This, for example, is the case for tensor creation:
T.zeros()
,T.ones()
,T.eye()
operators take a shape tuple as inputT.zeros_like()
,T.one_like()
,T.identity_like()
use the shape of the tensor argumentT.arange()
,T.mgrid()
,T.ogrid()
are used for range and mesh grid arrays
Let's have a look in the Python shell:
>>> a = T.zeros((2,3)) >>> a.eval() array([[ 0., 0., 0.], [ 0., 0., 0.]]) >>> b = T.identity_like(a) >>> b.eval() array([[ 1., 0., 0.], [ 0., 1., 0.]]) >>> c = T.arange(10) >>> c.eval() array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Information such as the number of dimensions, ndim
, and the type, dtype
, are defined at tensor creation and cannot be modified later:
>>> c.ndim
1
>>> c.dtype
'int64'
>>> c.type
TensorType(int64, vector)
Some other information, such as shape, is evaluated by the computation graph:
>>> a = T.matrix() >>> a.shape Shape.0 >>> a.shape.eval({a: [[1, 2], [1, 3]]}) array([2, 2]) >>> shape_fct = theano.function([a],a.shape) >>> shape_fct([[1, 2], [1, 3]]) array([2, 2]) >>> n = T.iscalar() >>> c = T.arange(n) >>> c.shape.eval({n:10}) array([10])
The first type of operator on tensor is for dimension manipulation. This type of operator takes a tensor as input and returns a new tensor:
Operator |
Description |
---|---|
|
Reshape the dimension of the tensor |
|
Fill the array with the same value |
|
Return all elements in a 1-dimensional tensor (vector) |
|
Change the order of the dimension, more or less like NumPy's transpose method – the main difference is that it can be used to add or remove broadcastable dimensions (of length 1). |
|
Reshape by removing dimensions equal to 1 |
|
Transpose |
|
Swap dimensions |
|
Sort tensor, or indices of the order |
For example, the reshape operation's output represents a new tensor, containing the same elements in the same order but in a different shape:
>>> a = T.arange(10) >>> b = T.reshape( a, (5,2) ) >>> b.eval() array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
The operators can be chained:
>>> T.arange(10).reshape((5,2))[::-1].T.eval() array([[8, 6, 4, 2, 0], [9, 7, 5, 3, 1]])
Notice the use of traditional [::-1]
array access by indices in Python and the .T
for T.transpose
.
The second type of operations on multi-dimensional arrays is elementwise operators.
The first category of elementwise operations takes two input tensors of the same dimensions and applies a function, f
, elementwise, which means on all pairs of elements with the same coordinates in the respective tensors f([a,b],[c,d]) = [ f(a,c), f(b,d)]
. For example, here's multiplication:
>>> a, b = T.matrices('a', 'b') >>> z = a * b >>> z.eval({a:numpy.ones((2,2)).astype(theano.config.floatX), b:numpy.diag((3,3)).astype(theano.config.floatX)}) array([[ 3., 0.], [ 0., 3.]])
The same multiplication can be written as follows:
>>> z = T.mul(a, b)
T.add
and T.mul
accept an arbitrary number of inputs:
>>> z = T.mul(a, b, a, b)
Some elementwise operators accept only one input tensor f([a,b]) = [f(a),f(b)])
:
>>> a = T.matrix() >>> z = a ** 2 >>> z.eval({a:numpy.diag((3,3)).astype(theano.config.floatX)}) array([[ 9., 0.], [ 0., 9.]])
Lastly, I would like to introduce the mechanism of broadcasting. When the input tensors do not have the same number of dimensions, the missing dimension will be broadcasted, meaning the tensor will be repeated along that dimension to match the dimension of the other tensor. For example, taking one multi-dimensional tensor and a scalar (0-dimensional) tensor, the scalar will be repeated in an array of the same shape as the multi-dimensional tensor so that the final shapes will match and the elementwise operation will be applied, f([a,b], c) = [ f(a,c), f(b,c) ]
:
>>> a = T.matrix() >>> b = T.scalar() >>> z = a * b >>> z.eval({a:numpy.diag((3,3)).astype(theano.config.floatX),b:3}) array([[ 6., 0.], [ 0., 6.]])
Here is a list of elementwise operations:
Operator |
Other form |
Description |
---|---|---|
|
|
Add, subtract, multiply, divide |
|
|
Power, square root |
|
Exponential, logarithm | |
|
Cosine, sine, tangent | |
|
Hyperbolic trigonometric functions | |
|
|
Int div, modulus |
|
Rounding operators | |
|
Sign | |
|
|
Bitwise operators |
|
|
Comparison operators |
|
Equality, inequality, or close with tolerance | |
|
Comparison with NaN (not a number) | |
|
Absolute value | |
|
Minimum and maximum elementwise | |
|
Clip the values between a maximum and a minimum | |
|
Switch | |
|
Tensor type casting |
The elementwise operators always return an array with the same size as the input array. T.switch
and T.clip
accept three inputs.
In particular, T.switch
will perform the traditional switch
operator elementwise:
>>> cond = T.vector('cond') >>> x,y = T.vectors('x','y') >>> z = T.switch(cond, x, y) >>> z.eval({ cond:[1,0], x:[10,10], y:[3,2] }) array([ 10., 2.], dtype=float32)
At the same position where cond
tensor is true, the result has the x
value; otherwise, if it is false, it has the y
value.
For the T.switch
operator, there is a specific equivalent, ifelse
, that takes a scalar condition instead of a tensor condition. It is not an elementwise operation though, and supports lazy evaluation (not all elements are computed if the answer is known before it finishes):
>>> from theano.ifelse import ifelse >>> z=ifelse(1, 5, 4) >>> z.eval() array(5, dtype=int8)
Another type of operation on tensors is reductions, reducing all elements to a scalar value in most cases, and for that purpose, it is required to scan all the elements of the tensor to compute the output:
Operator |
Description |
---|---|
|
Maximum, index of the maximum |
|
Minimum, index of the minimum |
|
Sum or product of elements |
|
Mean, variance, and standard deviation |
|
AND and OR operations with all elements |
|
Range of elements (minimum, maximum) |
These operations are also available row-wise or column-wise by specifying an axis and the dimension along which the reduction is performed:
>>> a = T.matrix('a') >>> T.max(a).eval({a:[[1,2],[3,4]]}) array(4.0, dtype=float32) >>> T.max(a,axis=0).eval({a:[[1,2],[3,4]]}) array([ 3., 4.], dtype=float32) >>> T.max(a,axis=1).eval({a:[[1,2],[3,4]]}) array([ 2., 4.], dtype=float32)
A third category of operations are the linear algebra operators, such as matrix multiplication:

Also called inner product for vectors:

Operator |
Description |
---|---|
|
Matrix multiplication/inner product |
|
Outer product |
There are some generalized (T.tensordot
to specify the axis), or batched (batched_dot, batched_tensordot
) versions of the operators.
Lastly, a few operators remain and can be very useful, but they do not belong to any of the previous categories: T.concatenate
concatenates the tensors along the specified dimension, T.stack
creates a new dimension to stack the input tensors, and T.stacklist
creates new patterns to stack tensors together:
>>> a = T.arange(10).reshape((5,2)) >>> b = a[::-1] >>> b.eval() array([[8, 9], [6, 7], [4, 5], [2, 3], [0, 1]]) >>> a.eval() array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> T.concatenate([a,b]).eval() array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [8, 9], [6, 7], [4, 5], [2, 3], [0, 1]]) >>> T.concatenate([a,b],axis=1).eval() array([[0, 1, 8, 9], [2, 3, 6, 7], [4, 5, 4, 5], [6, 7, 2, 3], [8, 9, 0, 1]]) >>> T.stack([a,b]).eval() array([[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]], [[8, 9], [6, 7], [4, 5], [2, 3], [0, 1]]])
An equivalent of the NumPy expressions a[5:] = 5
and a[5:] += 5
exists as two functions:
>>> a.eval() array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> T.set_subtensor(a[3:], [-1,-1]).eval() array([[ 0, 1], [ 2, 3], [ 4, 5], [-1, -1], [-1, -1]]) >>> T.inc_subtensor(a[3:], [-1,-1]).eval() array([[0, 1], [2, 3], [4, 5], [5, 6], [7, 8]])
Unlike NumPy's syntax, the original tensor is not modified; instead, a new variable is created that represents the result of that modification. Therefore, the original variable a
still refers to the original value, and the returned variable (here unassigned) represents the updated one, and the user should use that new variable in the rest of their computation.
It is good practice to always cast float arrays to the theano.config.floatX
type:
Either at the array creation with
numpy.array(array, dtype=theano.config.floatX)
Or by casting the array as
array.as_type(theano.config.floatX)
so that when compiling on the GPU, the correct type is used
For example, let's transfer the data manually to the GPU (for which the default context is None), and for that purpose, we need to use float32
values:
>>> theano.config.floatX = 'float32' >>> a = T.matrix() >>> b = a.transfer(None) >>> b.eval({a:numpy.ones((2,2)).astype(theano.config.floatX)}) gpuarray.array([[ 1. 1.] [ 1. 1.]], dtype=float32) >>> theano.printing.debugprint(b) GpuFromHost<None> [id A] '' |<TensorType(float32, matrix)> [id B]
The transfer(device)
functions, such as transfer('cpu')
, enable us to move the data from one device to another one. It is particularly useful when parts of the graph have to be executed on different devices. Otherwise, Theano adds the transfer functions automatically to the GPU in the optimization phase:
>>> a = T.matrix('a') >>> b = a ** 2 >>> sq = theano.function([a],b) >>> theano.printing.debugprint(sq) HostFromGpu(gpuarray) [id A] '' 2 |GpuElemwise{Sqr}[(0, 0)]<gpuarray> [id B] '' 1 |GpuFromHost<None> [id C] '' 0 |a [id D]
Using the transfer function explicitly, Theano removes the transfer back to CPU. Leaving the output tensor on the GPU saves a costly transfer:
>>> b = b.transfer(None) >>> sq = theano.function([a],b) >>> theano.printing.debugprint(sq) GpuElemwise{Sqr}[(0, 0)]<gpuarray> [id A] '' 1 |GpuFromHost<None> [id B] '' 0 |a [id C]
The default context for the CPU is cpu
:
>>> b = a.transfer('cpu') >>> theano.printing.debugprint(b) <TensorType(float32, matrix)> [id A]
A hybrid concept between numerical values and symbolic variables is the shared variables. They can also lead to better performance on the GPU by avoiding transfers. Initializing a shared variable with the scalar zero:
>>> state = shared(0) >>> state <TensorType(int64, scalar)> >>> state.get_value() array(0) >>> state.set_value(1) >>> state.get_value() array(1)
Shared values are designed to be shared between functions. They can also be seen as an internal state. They can be used indifferently from the GPU or the CPU compile code. By default, shared variables are created on the default device (here, cuda
), except for scalar integer values (as is the case in the previous example).
It is possible to specify another context, such as cpu
. In the case of multiple GPU instances, you'll define your contexts in the Python command line, and decide on which context to create the shared variables:
PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1,floatX=float32,gpuarray.preallocate=0.8" python
>>> from theano import theano Using cuDNN version 5110 on context dev0 Preallocating 9151/11439 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9151/11439 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) >>> import theano.tensor as T >>> import numpy >>> theano.shared(numpy.random.random((1024, 1024)).astype('float32'),target='dev1') <GpuArrayType<dev1>(float32, (False, False))>
The previous section introduced the function
instruction to compile the expression. In this section, we develop some of the following arguments in its signature:
def theano.function(inputs, outputs=None, updates=None, givens=None, allow_input_downcast=None, mode=None, profile=None, )
We've already used the allow_input_downcast
feature to convert data from float64
to float32
, int64
to int32
and so on. The mode
and profile
features are also displayed because they'll be presented in the optimization and debugging section.
Input variables of a Theano function should be contained in a list, even when there is a single input.
For outputs, it is possible to use a list in the case of multiple outputs to be computed in parallel:
>>> a = T.matrix() >>> ex = theano.function([a],[T.exp(a),T.log(a),a**2]) >>> ex(numpy.random.randn(3,3).astype(theano.config.floatX)) [array([[ 2.33447003, 0.30287042, 0.63557744], [ 0.18511547, 1.34327984, 0.42203984], [ 0.87083125, 5.01169062, 6.88732481]], dtype=float32), array([[-0.16512829, nan, nan], [ nan, -1.2203927 , nan], [ nan, 0.47733498, 0.65735561]], dtype=float32), array([[ 0.71873927, 1.42671108, 0.20540957], [ 2.84521151, 0.08709242, 0.74417454], [ 0.01912885, 2.59781313, 3.72367549]], dtype=float32)]
The second useful attribute is the updates
attribute, used to set new values to shared variables once the expression has been evaluated:
>>> w = shared(1.0) >>> x = T.scalar('x') >>> mul = theano.function([x],updates=[(w,w*x)]) >>> mul(4) [] >>> w.get_value() array(4.0)
Such a mechanism can be used as an internal state. The shared variable w
has been defined outside the function.
With the givens
parameter, it is possible to change the value of any symbolic variable in the graph, without changing the graph. The new value will then be used by all the other expressions that were pointing to it.
The last and most important feature in Theano is the automatic differentiation, which means that Theano computes the derivatives of all previous tensor operators. Such a differentiation is performed via the theano.grad
operator:
>>> a = T.scalar() >>> pow = a ** 2 >>> g = theano.grad(pow,a) >>> theano.printing.pydotprint(g) >>> theano.printing.pydotprint(theano.function([a],g))

In the optimization graph, theano.grad
has computed the gradient of with respect to
a
, which is a symbolic expression equivalent to 2 * a.
Note that it is only possible to take the gradient of a scalar, but the wrt variables can be arbitrary tensors.
The Python for
loop can be used outside the symbolic graph, as in a normal Python program. But outside the graph, a traditional Python for
loop isn't compiled, so it will not be optimized with parallel and algebra libraries, cannot be automatically differentiated, and introduces costly data transfers if the computation subgraph has been optimized for GPU.
That's why a symbolic operator, T.scan
, is designed to create a for
loop as an operator inside the graph. Theano will unroll the loop into the graph structure and the whole unrolled loop is going to be compiled on the target architecture as the rest of the computation graph. Its signature is as follows:
def scan(fn, sequences=None, outputs_info=None, non_sequences=None, n_steps=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None, profile=False, allow_gc=None, strict=False)
The scan
operator is very useful to implement array loops, reductions, maps, multi-dimensional derivatives such as Jacobian or Hessian, and recurrences.
The scan
operator is running the fn
function repeatedly for n_steps
. If n_steps
is None
, the operator will find out by the length of the sequences:
Note
The step fn
function is a function that builds a symbolic graph, and that function will only get called once. However, that graph will then be compiled into another Theano function that will be called repeatedly. Some users try to pass a compile Theano function as fn
, which is not possible.
Sequences are the lists of input variables to loop over. The number of steps will correspond to the shortest sequence in the list. Let's have a look:
>>> a = T.matrix() >>> b = T.matrix() >>> def fn(x): return x + 1 >>> results, updates = theano.scan(fn, sequences=a) >>> f = theano.function([a], results, updates=updates) >>> f(numpy.ones((2,3)).astype(theano.config.floatX)) array([[ 2., 2., 2.], [ 2., 2., 2.]], dtype=float32)
The scan
operator has been running the function against all elements in the input tensor, a
, and kept the same shape as the input tensor, (2,3)
.
Note
It is a good practice to add the updates returned by theano.scan
in the theano.function
, even if these updates are empty.
The arguments given to the fn
function can be much more complicated. T.scan
will call the fn
function at each step with the following argument list, in the following order:
fn( sequences (if any), prior results (if needed), non-sequences (if any) )
As shown in the following figure, three arrows are directed towards the fn
step function and represent the three types of possible input at each time step in the loop:

If specified, the outputs_info
parameter is the initial state to use to start recurrence from. The parameter name does not sound very good, but the initial state also gives the shape information of the last state, as well as all other states. The initial state can be seen as the first output. The final output will be an array of states.
For example, to compute the cumulative sum in a vector, with an initial state of the sum at 0
, use this code:
>>> a = T.vector()
>>> s0 = T.scalar("s0")
>>> def fn( current_element, prior ):
... return prior + current_element
>>> results, updates = theano.scan(fn=fn,outputs_info=s0,sequences=a)
>>> f = theano.function([a,s0], results, updates=updates)
>>> f([0,3,5],0)
array([ 0., 3., 8.], dtype=float32)
When outputs_info
is set, the first dimension of the outputs_info
and sequence variables is the time step. The second dimension is the dimensionality of data at each time step.
In particular, outputs_info
has the number of previous time-steps required to compute the first step.
Here is the same example, but with a vector at each time step instead of a scalar for the input data:
>>> a = T.matrix() >>> s0 = T.scalar("s0") >>> def fn( current_element, prior ): ... return prior + current_element.sum() >>> results, updates = theano.scan(fn=fn,outputs_info=s0,sequences=a) >>> f = theano.function([a,s0], results, updates=updates) >>> f(numpy.ones((20,5)).astype(theano.config.floatX),0) array([ 5., 10., 15., 20., 25., 30., 35., 40., 45., 50., 55., 60., 65., 70., 75., 80., 85., 90., 95., 100.], dtype=float32)
Twenty steps along the rows (times) have accumulated the sum of all elements. Note that initial state (here 0
) given by the outputs_info
argument is not part of the output sequence.
The recurrent function, fn
, may be provided with some fixed data, independent of the step in the loop, thanks to the non_sequences
scan parameter:
>>> a = T.vector() >>> s0 = T.scalar("s0") >>> def fn( current_element, prior, non_seq ): ... return non_seq * prior + current_element >>> results, updates = theano.scan(fn=fn,n_steps=10,sequences=a,outputs_info=T.constant(0.0),non_sequences=s0) >>> f = theano.function([a,s0], results, updates=updates) >>> f(numpy.ones((20)).astype(theano.),5) array([ 1.00000000e+00, 6.00000000e+00, 3.10000000e+01, 1.56000000e+02, 7.81000000e+02, 3.90600000e+03, 1.95310000e+04, 9.76560000e+04, 4.88281000e+05, 2.44140600e+06], dtype=float32)
It is multiplying the prior value by 5
and adding the new element.
Note that T.scan
in the optimized graph on GPU does not execute different iterations of the loop in parallel, even in the absence of recurrence.
For debugging purpose, Theano can print more verbose information and offers different optimization modes:
>>> theano.config.exception_verbosity='high' >>> theano.config.mode 'Mode' >>> theano.config.optimizer='fast_compile'
In order for Theano to use the config.optimizer
value, the mode has to be set to Mode
, otherwise the value in config.mode
will be used:
config.mode / function mode |
config.optimizer (*) |
Description |
---|---|---|
|
|
Default; best run performance, slow compilation |
|
|
Disable optimizations |
|
|
Reduce the number of optimizations, compiles faster |
|
Use the default mode, equivalent to | |
|
NaNs, Infs, and abnormally big value will raise errors | |
|
Self-checks and assertions during compilation |
The same parameter as in config.mode
can be used in the Mode
parameter in the function compile:
>>> f = theano.function([a,s0], results, updates=updates, mode='FAST_COMPILE')
Disabling optimization and choosing high verbosity will help finding errors in the computation graph.
For debugging on the GPU, you need to set a synchronous execution with the environment variable CUDA_LAUNCH_BLOCKING
, since GPU execution is by default, fully asynchronous:
CUDA_LAUNCH_BLOCKING=1 python
To find out the origin of the latencies in your computation graph, Theano provides a profiling mode.
Activate profiling:
>>> theano.config.profile=True
Activate memory profiling:
>>> theano.config.profile_memory=True
Activate profiling of optimization phase:
>>> theano.config.profile_optimizer=True
Or directly during compilation:
>>> f = theano.function([a,s0], results, profile=True) >>> f.profile.summary() Function profiling ================== Message: <stdin>:1 Time in 1 calls to Function.__call__: 1.490116e-03s Time in Function.fn.__call__: 1.251936e-03s (84.016%) Time in thunks: 1.203537e-03s (80.768%) Total compile time: 1.720619e-01s Number of Apply nodes: 14 Theano Optimizer time: 1.382768e-01s Theano validate time: 1.308680e-03s Theano Linker time (includes C, CUDA code generation/compiling): 2.405691e-02s Import time 1.272917e-03s Node make_thunk time 2.329803e-02s Time in all call to theano.grad() 0.000000e+00s Time since theano import 520.661s Class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 58.2% 58.2% 0.001s 7.00e-04s Py 1 1 theano.scan_module.scan_op.Scan 27.3% 85.4% 0.000s 1.64e-04s Py 2 2 theano.sandbox.cuda.basic_ops.GpuFromHost 6.1% 91.5% 0.000s 7.30e-05s Py 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu 5.5% 97.0% 0.000s 6.60e-05s C 1 1 theano.sandbox.cuda.basic_ops.GpuIncSubtensor 1.1% 98.0% 0.000s 3.22e-06s C 4 4 theano.tensor.elemwise.Elemwise 0.7% 98.8% 0.000s 8.82e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuSubtensor 0.7% 99.4% 0.000s 7.87e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuAllocEmpty 0.3% 99.7% 0.000s 3.81e-06s C 1 1 theano.compile.ops.Shape_i 0.3% 100.0% 0.000s 1.55e-06s C 2 2 theano.tensor.basic.ScalarFromTensor ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) Ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 58.2% 58.2% 0.001s 7.00e-04s Py 1 1 forall_inplace,gpu,scan_fn} 27.3% 85.4% 0.000s 1.64e-04s Py 2 2 GpuFromHost 6.1% 91.5% 0.000s 7.30e-05s Py 1 1 HostFromGpu 5.5% 97.0% 0.000s 6.60e-05s C 1 1 GpuIncSubtensor{InplaceSet;:int64:} 0.7% 97.7% 0.000s 8.82e-06s C 1 1 GpuSubtensor{int64:int64:int16} 0.7% 98.4% 0.000s 7.87e-06s C 1 1 GpuAllocEmpty 0.3% 98.7% 0.000s 4.05e-06s C 1 1 Elemwise{switch,no_inplace} 0.3% 99.0% 0.000s 4.05e-06s C 1 1 Elemwise{le,no_inplace} 0.3% 99.3% 0.000s 3.81e-06s C 1 1 Shape_i{0} 0.3% 99.6% 0.000s 1.55e-06s C 2 2 ScalarFromTensor 0.2% 99.8% 0.000s 2.86e-06s C 1 1 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}} 0.2% 100.0% 0.000s 1.91e-06s C 1 1 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)] ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) Apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 58.2% 58.2% 0.001s 7.00e-04s 1 12 forall_inplace,gpu,scan_fn}(TensorConstant{10}, GpuSubtensor{int64:int64:int16}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuFromHost.0) 21.9% 80.1% 0.000s 2.64e-04s 1 3 GpuFromHost(<TensorType(float32, vector)>) 6.1% 86.2% 0.000s 7.30e-05s 1 13 HostFromGpu(forall_inplace,gpu,scan_fn}.0) 5.5% 91.6% 0.000s 6.60e-05s 1 4 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, CudaNdarrayConstant{[ 0.]}, Constant{1}) 5.3% 97.0% 0.000s 6.41e-05s 1 0 GpuFromHost(s0) 0.7% 97.7% 0.000s 8.82e-06s 1 11 GpuSubtensor{int64:int64:int16}(GpuFromHost.0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) 0.7% 98.4% 0.000s 7.87e-06s 1 1 GpuAllocEmpty(TensorConstant{10}) 0.3% 98.7% 0.000s 4.05e-06s 1 8 Elemwise{switch,no_inplace}(Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{0}) 0.3% 99.0% 0.000s 4.05e-06s 1 6 Elemwise{le,no_inplace}(Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, TensorConstant{0}) 0.3% 99.3% 0.000s 3.81e-06s 1 2 Shape_i{0}(<TensorType(float32, vector)>) 0.3% 99.6% 0.000s 3.10e-06s 1 10 ScalarFromTensor(Elemwise{switch,no_inplace}.0) 0.2% 99.8% 0.000s 2.86e-06s 1 5 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}(TensorConstant{10}, Shape_i{0}.0) 0.2% 100.0% 0.000s 1.91e-06s 1 7 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, Shape_i{0}.0) 0.0% 100.0% 0.000s 0.00e+00s 1 9 ScalarFromTensor(Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)].0) ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
The first concept is symbolic computing, which consists in building graph, that can be compiled and then executed wherever we decide in the Python code. A compiled graph is acting as a function that can be called anywhere in the code. The purpose of symbolic computing is to have an abstraction of the architecture on which the graph will be executed, and which libraries to compile it with. As presented, symbolic variables are typed for the target architecture during compilation.
The second concept is the tensor, and the operators provided to manipulate tensors. Most of these were already available in CPU-based computation libraries, such as NumPy or SciPy. They have simply been ported to symbolic computing, requiring their equivalents on GPU. They use underlying acceleration libraries, such as BLAS, Nvidia Cuda, and cuDNN.
The last concept introduced by Theano is automatic differentiation—a very useful feature in deep learning to backpropagate errors and adjust the weights following the gradients, a process known as gradient descent. Also, the scan
operator enables us to program loops (while...
, for...
,) on the GPU, and, as other operators, available through backpropagation as well, simplifying the training of models a lot.
We are now ready to apply this to deep learning in the next few chapters and have a look at this knowledge in practice.