Answers
- The non-parallelizable part of the code is 10% or 0.1. Therefore, according to Amdahl's law, the maximum speedup is 10×. If we want to reach 90% of that, we want to achieve a speedup of 9×. If we plug this into Amdahl's law again and solve for s, we need to speed up the parallelizable part 81×, so if we assume linear scaling, we need 81× the amount of compute resources. To reach 99% of the maximum speedup, we want to reach a speedup of 9.9×. This corresponds to s=891; in other words, we need 891× the amount of compute resources. In reality, we would need even more resources to account for imperfect scaling.
- Memory bottlenecks, coordination/synchronization/communication between workers, and workload imbalance between workers.
- No. The loop serves to sequentially update a single variable. If we distribute this over multiple threads, we will get completely non-deterministic behavior as different threads compete to try to update
z. - The key is that, while the Numba decorated function looks like a Python function, it no longer is a Python function. Numba compiles the function to machine code, which is similarly efficient as compiled C and Fortran.
- No. Because the JITed function is no longer running on the Python interpreter, it is no longer being run line by line, and line profiling becomes meaningless.
- GPU threads are very lightweight, and the GPU is very good at efficiently scheduling those threads and switching between them. When there is more work to be done than there are threads, we are trying to do the job of the scheduler, because we are instructing our threads to do multiple tasks that could be parallelized. CPU threads are heavyweight, and typically, there is no benefit to adding more threads than the number of cores.
Get this book's PDF version and more
Scan the QR code (or go to packtpub.com/unlock). Search for this book by name, confirm the edition, and then follow the steps on the page.


Note: Keep your invoice handy. Purchases made directly from Packt don't require an invoice.