Before a Jupyter Notebook is developed you should confront optimizations that should occur before the public starts their access. Optimizations cover a gamut of options running from language-specific issues (use best practice R coding style) to deploying your notebook in a highly available environment.
You're reading from Jupyter for Data Science
A Jupyter Notebook is a website. You could host a website on the computer that you are using to display this document. There may be a machine available in your department that is in use as a web server.
If you were to deploy on a local machine you would have a single user website where additional users would be blocked from access or would collide with each other. The first step towards publishing your notebook involves using a hosting service that provides multiple user access.
The predominant Jupyter hosting product currently is JupyterHub. To be clear, JupyterHub is installed into a machine under your control. It provides multi-user access to your notebooks. This means you could install JupyterHub on a machine in your environment and only internal users (multiple internal users) could access it.
When JupyterHub starts it begins a hub or controlling agent. The hub will start an instance of a listener or proxy for Jupyter requests. When the proxy...
There are optimizations that you can make to have your notebook scripts run more efficiently. The optimizations are script language dependent. We have covered using Python and R scripts in our notebooks and will cover optimizations that can be made for those two languages.
Jupyter does support additional languages, such as Scala and Spark. The other languages have their own optimization tools and strategies.
Performance tuning your Python scripts can be done using several tools:
timeit
- Python regular expressions
- String handling
- Loop optimizations
hotshot
profiling
The timeit
function in Python takes a line of code and determines how long it takes to execute. You can also repeatedly execute the same script to see if there are start-up issues that need to be addressed.
timeit
is used in this manner:
import timeitt = timeit.Timer("myfunction('Hello World')", "import myfunction") t.timeit() 3.32132323232...
As with the earlier discussions in this chapter on optimization, you can also use programming tools to monitor the overall interactions of your notebook. The predominant tool for Linux/Mac environments is memory_profiler
. If you start this tool then your notebook, the profiler will keep track of memory use of your notebook.
With this record of information points you may be able to adjust your programmatic memory allocation to be smaller in profile if you find a large memory use occurring. For example, the profiler may highlight that you are creating (and dropping) a large memory item continuously inside of a loop. When you go back to your coding you realize this memory access could be pulled out of the loop and just done once or that size of the allocation could be minimized easily.
Caching is a common programming practice to speed up performance. If the computer does not have to reload a section of code or variable or file, but can just access directly from a cache this will improve performance.
There is a mechanism to cache your notebook if you are deploying into a Docker space. Docker is a mechanism for virtualizing code over many instances in one machine. It has become common practice to do so in the Java programming world. Luckily, Docker is very flexible and a method has been determined to use Jupyter in Docker as well. Once in Docker, it is a minor adjustment to automatically cache your pages in Docker. The underlying tool used is memcached
, yet another widespread common tool for caching anything, in this case Jupyter Notebooks.
Securing a notebook can be accomplished by several methods such as:
- Manage authorization
- Securing notebook content
A notebook can be secured to use username/password authorization. Authorization is on by default in your notebook. Under Jupyter it is token/password instead of username/password as a token is more open to interpretation. See Jupyter documentation on implementing authorization as this has changed slightly over time.
A notebook has possible security issues with several parts of standard content that are secured automatically by Jupyter:
- Untrusted HTML is sanitized
- Untrusted JavaScript is not executed
- HTML and JavaScript in markdown cells is not trusted
- Notebook output is not trusted
- Other HTML or JavaScript in the notebook is not trusted
Where trust comes down to the question: Did the user do this or did the Jupyter script? Untrusted means it will not be generated.
Sanitized code is wrapped to force the values to...
Scaling is the process of providing very large numbers of concurrent users to a notebook without a degradation in performance. The one vendor that is doing this today is Azure. They have thousands of pages and users working at scale daily.
Most amazingly this is a free service.
You can also share a notebook with others by converting the notebook to a readable form for recipients. Notebooks can be converted to a number of formats using the Download As
feature in the notebook File
menu.
Notebooks can be converted in this way to the formats:
- <language> format: This option is dependent on the language used to create the notebook. For example, an R notebook would have the choice to
Download as R script.
HTML
: This representation is the HTML encoding to display the page as it appears in your notebook using HTML constructs.Markdown
: Markdown is a simple display tag format used by some older Linux systems.reST
: Another markdown type of format that has simpler display constructs than HTML.PDF.
A common practice in the programming world is to maintain a history of the changes made to a program. Over time the different versions of the program are maintained in a software repository where the programmer can retrieve prior versions to return to an older, working state of their program.
In the previous section we mentioned placing your notebook on GitHub. Git is a software repository in wide use. GitHub is an internet-based instance of Git. Once you have any software in Git it will automatically be versioned. The next time you update your notebook in GitHub. Git will take the current instance, store it as a version in your history, and place the new instance as the current—where anyone accessing your GitHub repository will see the latest version by default.