Assessment

Chapter 1, Why GPU Programming?

The first two for loops iterate over every pixel, whose outputs are invariant to each other; we can thus parallelize over these two for loops. The third for loop calculates the final value of a particular pixel, which is intrinsically recursive.
Amdahl's Law doesn't account for the time it takes to transfer memory between the GPU and the host.
512 x 512 amounts to 262,144 pixels. This means that the first GPU can only calculate the outputs of half of the pixels at once, while the second GPU can calculate all of the pixels at once; this means the second GPU will be about twice as fast as the first here. The third GPU has more than sufficient cores to calculate all pixels at once, but as we saw in problem 1, the extra cores will be of no use to us here. So the second and third GPUs will be equally fast for this problem.
One issue with generically...

Yes.
Memory transfers between host/device, and compilation time.
You can, but this will vary depending on your GPU and CPU setup.
Do this using the C ? operator for both the point-wise and reduce operations.
If a gpuarray object goes out of scope its destructor is called, which will deallocate (free) the memory it represents on the GPU automatically.
ReductionKernel may perform superfluous operations, which may be necessary depending on how the underlying GPU code is structured. A neutral element will ensure that no values are altered as a result of these superfluous operations.
We should set neutral to the smallest possible value of a signed 32-bit integer.

Try it.
All of the threads don't operate on the GPU simultaneously. Much like a CPU switching between tasks in an OS, the individual cores of the GPU switch between the different threads for a kernel.
O( n/640 log n), that is, O(n log n).
Try it.

There is actually no internal grid-level synchronization in CUDA—only block-level (with __syncthreads). We have to synchronize anything above a single block with the host.
Naive: 129 addition operations. Work-efficient: 62 addition operations.
Again, we can't use __syncthreads if we need to synchronize over a large grid of blocks. We can also launch over fewer threads on each iteration if we synchronize on the host, freeing up more resources for other operations.
In the case of a naive parallel sum, we will likely be working with only a small number of data points that...

The performance improves for both; as we increase the number of threads, the GPU reaches peak utilization in both cases, reducing the gains made through using streams.
Yes, you can launch an arbitrary number of kernels asynchronously and synchronize them to with cudaDeviceSynchronize.
Open up your text editor and try it!
High standard deviation would mean that the GPU is being used unevenly, overwhelming the GPU at some points and under-utilizing it at others. A low standard deviation would mean that all launched operations are running generally smoothly.
i. The host can generally handle far fewer concurrent threads than a GPU. ii. Each thread requires its own CUDA context. The GPU can become overwhelmed with excessive contexts, since each has its own memory space and has to handle its own loaded executable code.

...

Memory allocations are automatically synchronized in CUDA.
The lockstep property only holds in single blocks of size 32 or less. Here, the two blocks would properly diverge without any lockstep.
The same thing would happen here. This 64-thread block would actually be split into two 32-thread warps.
Nvprof can time individual kernel launches, GPU utilization, and stream usage; any host-side profiler would only see CUDA host functions being launched.
Printf is generally easier to use for small-scale projects with relatively short, inline kernels. If you write a very involved CUDA kernel with thousands of lines, then probably you would want to use the IDE to step through and debug your kernel line by line.
This tells CUDA which GPU we want to use.
cudaDeviceSynchronize will ensure that interdependent kernel launches and mem copies...

SBLAH starts with an S, so this function uses 32-bit real floats. ZBLEH starts with a Z, which means it works with 128-bit complex floats.
Hint: set trans = cublas._CUBLAS_OP['T']
Hint: use the Scikit-CUDA wrapper to the dot product, skcuda.cublas.cublasSdot
Hint: build upon the answer to the last problem.
You can put the cuBLAS operations in a CUDA stream and use event objects with this stream to precisely measure the computation times on the GPU.
Since the input appears as being complex to cuFFT, it will calculate all of the values as NumPy.
The dark edge is due to the zero-buffering around the image. This can be mitigated by mirroring the image on its edges rather than by using a zero-buffer.

Try it. (It's actually more accurate than you'd think.)
One application: a Gaussian distribution can be used to add white noise to samples to augment a dataset in machine learning.
No, since they are from different seeds, these lists may have a strong correlation if we concatenate them together. We should use subsequences of the same seed if we plan to concatenate them together.
Try it.
Hint: remember that matrix multiplication can be thought of as a series of matrix-vector multiplications, while matrix-vector multiplication can be thought of as a series of dot products.
Operator() is used to define the actual function.

One problem could be that we haven't normalized our training inputs. Another could be that the training rate was too large.
With a small training rate a set of weights might converge very slowly, or not at all.
A large training rate can lead to a set of weights being over-fit to particular batch values or this training set. Also, it can lead to numerical overflows/underflows as in the first problem.
Sigmoid.
Softmax.
More updates.

Only the EXE file will have the host functions, but both the PTX and EXE will contain the GPU code.
cuCtxDestory.
printf with arbitrary input parameters. (Try looking up the printf prototype.)
With a Ctypes c_void_p object.
This will allow us to link to the function with its original name from Ctypes.
Device memory allocations and memcopies between device/host are automatically synchronized by CUDA.

The fact that atomicExch is thread-safe doesn't guarantee that all threads will execute this function at the same time (which is not the case since different blocks in a grid can be executed at different times).
A block of size 100 will be executed over multiple warps, which will not be synchronized within the block unless we use __syncthreads. Thus, atomicExch may be called at multiple times.
Since a warp executes in lockstep by default, and blocks of size 32 or less are executed with a single warp, __syncthreads would be unnecessary.
We use a naïve parallel sum within the warp, but otherwise, we are doing as many sums withatomicAdd as we would do with a serial sum. While CUDA automatically parallelizes many of these atomicAdd invocations, we could reduce the total number of required atomicAdd invocations by implementing...

Two examples: DNA analysis and physics simulations.
Two examples: OpenACC, Numba.
TPUs are only used for machine learning operations and lack the components required to render graphics.
Ethernet.