Reader small image

You're reading from  Jupyter for Data Science

Product typeBook
Published inOct 2017
Reading LevelBeginner
PublisherPackt
ISBN-139781785880070
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dan Toomey
Dan Toomey
author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Right arrow

Chapter 10. Optimizing Jupyter Notebooks

Before a Jupyter Notebook is developed you should confront optimizations that should occur before the public starts their access. Optimizations cover a gamut of options running from language-specific issues (use best practice R coding style) to deploying your notebook in a highly available environment.

Deploying notebooks


A Jupyter Notebook is a website. You could host a website on the computer that you are using to display this document. There may be a machine available in your department that is in use as a web server.

If you were to deploy on a local machine you would have a single user website where additional users would be blocked from access or would collide with each other. The first step towards publishing your notebook involves using a hosting service that provides multiple user access.

Deploying to JupyterHub

The predominant Jupyter hosting product currently is JupyterHub. To be clear, JupyterHub is installed into a machine under your control. It provides multi-user access to your notebooks. This means you could install JupyterHub on a machine in your environment and only internal users (multiple internal users) could access it.

When JupyterHub starts it begins a hub or controlling agent. The hub will start an instance of a listener or proxy for Jupyter requests. When the proxy...

Optimizing your script


There are optimizations that you can make to have your notebook scripts run more efficiently. The optimizations are script language dependent. We have covered using Python and R scripts in our notebooks and will cover optimizations that can be made for those two languages.

Jupyter does support additional languages, such as Scala and Spark. The other languages have their own optimization tools and strategies.

Optimizing your Python scripts

Performance tuning your Python scripts can be done using several tools:

  • timeit
  • Python regular expressions
  • String handling
  • Loop optimizations
  • hotshot profiling

Determining how long a script takes

The timeit function in Python takes a line of code and determines how long it takes to execute. You can also repeatedly execute the same script to see if there are start-up issues that need to be addressed.

timeit is used in this manner:

import timeitt = timeit.Timer("myfunction('Hello World')", "import myfunction")   t.timeit()              3.32132323232...

Monitoring Jupyter


As with the earlier discussions in this chapter on optimization, you can also use programming tools to monitor the overall interactions of your notebook. The predominant tool for Linux/Mac environments is memory_profiler. If you start this tool then your notebook, the profiler will keep track of memory use of your notebook.

With this record of information points you may be able to adjust your programmatic memory allocation to be smaller in profile if you find a large memory use occurring. For example, the profiler may highlight that you are creating (and dropping) a large memory item continuously inside of a loop. When you go back to your coding you realize this memory access could be pulled out of the loop and just done once or that size of the allocation could be minimized easily.

Caching your notebook


Caching is a common programming practice to speed up performance. If the computer does not have to reload a section of code or variable or file, but can just access directly from a cache this will improve performance.

There is a mechanism to cache your notebook if you are deploying into a Docker space. Docker is a mechanism for virtualizing code over many instances in one machine. It has become common practice to do so in the Java programming world. Luckily, Docker is very flexible and a method has been determined to use Jupyter in Docker as well. Once in Docker, it is a minor adjustment to automatically cache your pages in Docker. The underlying tool used is memcached, yet another widespread common tool for caching anything, in this case Jupyter Notebooks.

Securing a notebook


Securing a notebook can be accomplished by several methods such as:

  • Manage authorization
  • Securing notebook content

Managing notebook authorization

A notebook can be secured to use username/password authorization. Authorization is on by default in your notebook. Under Jupyter it is token/password instead of username/password as a token is more open to interpretation. See Jupyter documentation on implementing authorization as this has changed slightly over time.

Securing notebook content

A notebook has possible security issues with several parts of standard content that are secured automatically by Jupyter:

  • Untrusted HTML is sanitized
  • Untrusted JavaScript is not executed
  • HTML and JavaScript in markdown cells is not trusted
  • Notebook output is not trusted
  • Other HTML or JavaScript in the notebook is not trusted

Where trust comes down to the question: Did the user do this or did the Jupyter script? Untrusted means it will not be generated.

Sanitized code is wrapped to force the values to...

Scaling Jupyter Notebooks


Scaling is the process of providing very large numbers of concurrent users to a notebook without a degradation in performance. The one vendor that is doing this today is Azure. They have thousands of pages and users working at scale daily.

Most amazingly this is a free service.

Sharing Jupyter Notebooks


Jupyter Notebooks can be shared by placing the notebook on a server (there are several kinds) or converting the notebook to another format (it will not be interactive, but the content will be available).

Sharing Jupyter Notebook on a notebook server

Built into the notebook configuration are extensions that can be used to expose a notebook server, directly. The notebook configuration can be generated using the following command:

Jupyter Notebook -generate-config

In the resulting jupyter_notebook_config.py file there are settings that can be used to set:

  • IP/port address of your notebook
  • Encryption certificate location
  • Password

By setting this and starting Jupyter you should be able to access the notebook at the IP address specified from other machines in your network.

Note

You should work with your network security personnel before doing so.

Sharing encrypted Jupyter Notebook on a notebook server

If you specify the certificate information correctly in the previous configuration...

Converting a notebook


You can also share a notebook with others by converting the notebook to a readable form for recipients. Notebooks can be converted to a number of formats using the Download As feature in the notebook File menu.

Notebooks can be converted in this way to the formats:

  • <language&gt; format: This option is dependent on the language used to create the notebook. For example, an R notebook would have the choice to Download as R script.
  • HTML: This representation is the HTML encoding to display the page as it appears in your notebook using HTML constructs.
  • Markdown: Markdown is a simple display tag format used by some older Linux systems.
  • reST: Another markdown type of format that has simpler display constructs than HTML.
  • PDF.

Versioning a notebook


A common practice in the programming world is to maintain a history of the changes made to a program. Over time the different versions of the program are maintained in a software repository where the programmer can retrieve prior versions to return to an older, working state of their program.

In the previous section we mentioned placing your notebook on GitHub. Git is a software repository in wide use. GitHub is an internet-based instance of Git. Once you have any software in Git it will automatically be versioned. The next time you update your notebook in GitHub. Git will take the current instance, store it as a version in your history, and place the new instance as the current—where anyone accessing your GitHub repository will see the latest version by default.

Summary


In this chapter, we deployed our notebook to a set of different environments. We looked into optimizations that can be made to our notebook scripts. We learned about different ways to share our notebook. Lastly, we looked into converting our notebook for users without access to Jupyter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Jupyter for Data Science
Published in: Oct 2017Publisher: PacktISBN-13: 9781785880070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey