Reader small image

You're reading from  Hands-On Data Science with Anaconda

Product typeBook
Published inMay 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788831192
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Yuxing Yan
Yuxing Yan
author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

James Yan
James Yan
author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan

View More author details
Right arrow

Managing Packages

In the preface, we mentioned that this book is for readers who are looking for tools in the area of data science. For the researchers or practitioners working in the area of data science, there are several important issues. First, they need to understand their raw data, such as its purpose, structure, how reliable and complex it is, and how it is collected. Second, researchers and practitioners should have a good method of processing that data. In other words, they should master at least one computer language, such as R, Python, or Julia. After learning a language's basics, they should turn to some related packages, since understanding these packages might determine how far they can go in the area of data science. In this chapter, the following topics will be covered:

  • Introduction to packages, modules, or toolboxes
  • Two examples of using packages
  • Finding...

Introduction to packages, modules, or toolboxes

Over the years, researchers or users have generated many packages around different specific tasks for various programming languages. For this book, we treat module or toolbox as a synonym for package. For the analyses in the area of data science, it is very important to use various packages to achieve our goals. There are several advantages in using various packages. First, we don't have to write our code from scratch if we can find some relevant programs contained in certain packages. This would save us a huge amount of time. In other words, we don't have to reinvent the wheel, and this is especially true for developers. Second, packages are usually developed by people who have certain expertise in relevant areas. Because of this, the quality of a package is usually higher than the programs written by a, relatively speaking...

Two examples of using packages

It is always a good idea to use examples to illustrate how useful or important it is to understand some closely related packages. The first example is extremely simple: generate a QR code for the CNN website. It has just two lines. Note that you need to run install.packages("qrcode") if the package is not preinstalled:

> library(qrcode) 
> qrcode_gen("https://www.cnn.com") 

The generated QR code is shown here. Users can use the QR scanner installed on their cell phone to go to the CNN website:

For the second example, we believe the best example for the researchers and users in the area of data science is an R package called rattle. If a user hasn't preinstalled the package, they can type the following line of R code:

>install.packages("rattle")

To launch the package, type the following two lines of R code...

Finding all R packages

For R-related packages, go to http://r-project.org first. Click on CRAN and choose a mirror location, then click Packages on the left-hand side. We can see two lists, as shown here:

On February 22, 2018, there are 12,173 R packages available. The first list contains all available packages sorted by their publication dates (that is, the dates they updated, or published if they were never updated). The second list is sorted by their names. If we just want to find relevant packages, either list will be fine. For example, for the first list, here is a snapshot of a few lines:

The first column shows when the packages were last updated, or published if no updates were available. The second column shows the names of the packages, while the last column offers a short description of the usage for each package. We can use keywords to find the packages we want...

Finding all Python packages

To find all Python packages, we can go to https://pypi.python.org/. The following screenshot shows the top part of the website. As of February 22, 2018, there are 130,230 packages available:

To find the packages we want, just click Browse packages and use keywords. For example, after entering Data Science, we will see the following results:

From the previous screenshot, we can see three columns: the first one gives the name of the package; the second one is called Weight, which can be viewed as the popularity index; and the last one offers a short description. The related URL is https://pypi.python.org/pypi?%3Aaction=search&term=data+science&submit=search.

Finding all Julia packages

For the packages written in Julia, we can go to https://pkg.julialang.org/. As of February 22, 2018, there are 1,725 packages available, as shown here:

Again, we can use keywords to search this list. For example, if we use data as our keyword, we will find 94 locations – the first one is shown in the following screenshot:

Finding all Octave packages

At https://octave.sourceforge.io/packages.php, we can find a list of all available packages for Octave:

Again, we can search for keywords. If the word data is used, we will find 10 locations – the first few are shown here:

Task views for R

A task view is a set of R packages grouped by one or more experts around a specific topic. For example, for data visualization, we could choose the task view called Graphics. For a text analysis, we could choose the NaturalLanguageAnalysis task view. To find a list of all these task views, we can go to the R home page at http://r-project.org. After clicking CRAN, choose a mirror server, then click Task Views on the left-hand side. The following screen will be displayed:

If we are interested in data visualization, then we can click on Graphics (see the following screenshot):

To save space, only the top part is shown. The task view gives many R-related packages around the topics of Graphic Display & Visualization. Another great benefit is installing all related packages by issuing just three lines of R code. Assume that we are interested in the task view related...

Finding manuals

For an R package, the best way to find the manual is to find the location of the installed R package. In the following example, we use the R package called rattle as an example:

> library(rattle) 
> path.package('rattle') 
[1] "C:/Users/yany/Documents/R/win-library/3.3/rattle" 

Note that different readers will definitively get different paths. Our result is shown in the following screenshot:

The PDF manual and HTML manuals are located under the doc subdirectory. It is a great idea to explore these subdirectories. To save space, we will not show the detailed files contained under the subdirectories. The second best way is to go to http://r-project.org, click CRAN, choose a nearby mirror location and click packages on the left-hand side. Then, from one of the two lists, search for the package. After clicking on the package, we will find...

Package dependencies

There are two types of package dependencies. The first one is the package depending on the version of underlying software. For example, take the Octave package called statistics, available at https://octave.sourceforge.io/statistics. On February 22, 2018, it has a version of 1.3.0 and it requires an underlying Octave with a version of at least 4.0.0, as shown in the last line of the following screenshot:

The second type of dependency is between packages. Developers of various packages use many functions embedded in other developed packages. Not only does this save time, but it also means they don't have to reinvent the wheel. From the last line of the previous screenshot, we know that this package depends on another Octave package called io.

In the following, we show the process of installation. First, we download the ZIP file from https://octave.sourceforge...

Package management in R

There are three ways to install an R package. The first way is to use the install.packages() function. For example, assume that we plan to install an R package called rattle. We can use the following code to do this:

   >install.packages("rattle")

The second way is to click Packages on the menu bar, choose a mirror location, then find the R package from a list (see the following screenshot showing the top part of the list):

The third way to install an R package is to install it from a local ZIP file. To do so, first, manually download a ZIP file to your computer. Then click Packages on the menu bar and choose Install package(s) from local files..., as shown here:

To update a package, click Packages on the menu bar, then choose Update packages... from the drop-down menu (that is, the fifth entry in the previous screenshot). Another way to...

Package management in Python

We can use conda to install Python-related packages (see the related section later in the chapter). If we have various Python compilers, we can install Python packages easily. For example, if we use the Canopy compiler by Enthought, we can use Package Manager, as shown in the following screenshot:

From this, we can find out how many packages were installed and how many are available.

It's quite simple to install or update a package. For example, to install a package, we simply choose one from the list. The same logic applies when we update one. To find all embedded functions within a package, we can use the following command:

>import matplotlib as mat 
>x=dir(mat) 
>print(x) 

The related screenshot is shown here:

Package management in Julia

To see a list of installed packages, we use the Pkg.status() function, as shown:

To save space, only the first several lines are shown. Alternatively, we can issue Pkg.installed(), which returns a dictionary, mapping installed package names to the versions of those which are installed, as shown here:

To add or remove a package, we apply the Pkg.add() and Pkg.rm() functions, as shown in this example:

Julia>Pkg.add("AbstractTable") 
Julia>Pkg.rm("AbstractTable")

To get all the latest versions, we issue the following command:

Julia>Pkg.update() 

Package management in Octave

We will use the Octave package called statistics as an example. First, we find the ZIP file for the package at https://octave.sourceforge.io/statistics/, as shown in the following screenshot:

Second, we set up our path to the directory containing the previously downloaded ZIP file. Third, we issue pkg install package_name, as shown:

> pkg install statistics-1.3.0.tar.gz 

For information about changes from previous versions of the statistics package, run news statistics. To get more information about the new version, we type news statistics as mentioned previously:

To load and unload a package, we have the following code:

>pkg load statistics 
>>pkg unload statistics 

As for all the other functions included in the statistics package, see https://octave.sourceforge.io/statistics/overview.html.

...

Conda – the package manager

After we launch Anaconda Prompt and issue conda help, we will see the following output:

From the previous help menu, we know that we can install, update, and uninstall a package. Usually, we can use conda to install a package. However, we may receive an error message (see the following example):

To update conda itself, we use conda update -n base conda command, as shown here:

We could find more information about specific Python packages by using the search function, as shown here:

The following table lists several of the most used commands:

Command
Explanation

Conda help

Conda info

Get help about the usages of conda

conda update -n base conda

Get information related to conda, such as the current version, base environment and the related websites

conda search matplotlib

Find all versions about this specific Python package...

Creating a set of programs in R and Python

On numerous occasions, for a specific research topic, researchers will collect many datasets and write many programs. Why not write a big program? There are several reasons for not doing so. First, we might need several steps to finish the project. Second, the project might be too complex, so we have divided the whole project into several small portions. Each researcher will be responsible for one or a few portions. Third, according to the flow of the whole process, we might want to have several parts, such as devoting a part to processing data, a part to running various regressions, and a part to summarizing the results. Because of this, we need a way of putting all the programs together. In the following example, we will show you how to achieve this in both R and Python. For R, assume that we have the following functions:

pv_f<-function...

Finding environmental variables

For R, we can use the Sys.getenv() function to find all environmental variables:

To save space, only the top part is shown. Again, different users will get different results. For Python, we use the following commands:

import sys
sys.path

The top part of the output is shown here:

For Julia, we use the ENV function, as in the following:

For Octave, we can use the getenv() function, as shown:

>> getenv('path') 
ans = C:OctaveOctave-4.0.0bin;C:Program FilesSilverfrostFTN95;C:Perlsitebin;C:Perlbin;C:windowssystem32;C:windows;C: 
windowsSystem32Wbem;C:windowsSystem32WindowsPowerShellv1.0;C:Program FilesIntelOpenCL SDK2.0binx86;C:Program Files 
Common FilesRoxio SharedDLLShared;C:Program FilesCommon FilesRoxio Shared10.0DLLShared;C:Program FilesMATLABR2013abin 
;C:Anaconda;C:AnacondaScripts;C:Program FilesWindows Kits8.1Windows Performance...

Summary

In this chapter, we have first discussed the importance of managing packages. Then, we have shown how to find all the available packages for R, Python, Julia, and Octave, how to install and update individual packages, and how to find the manual for teaching the packages. In addition, we have explained the issue of package dependencies and how to make our programming a little easier when dealing with packages. The topic of systematic environment was touched on as well.

In Chapter 7, Optimization in Anaconda, we will discuss several topics around optimization, such as general issues for optimization problems and expressing various kinds of optimization problems (for example, LP and quadratic optimization). Several examples are offered to make our discussion more practice-oriented, such as how to choose an optimal stock portfolio and how to optimize wealth and resources to...

Review questions and exercises

  1. Why is understanding various packages important?
  2. What are package dependencies?
  3. For R, Python, Julia, and Octave, find out how many packages are available for each of them, today.
  4. How do we install a package in R, Python, and Julia?
  5. How do we update a package in R, Python, and Julia?
  6. What is the task view for R?
  7. How do we install all R packages included in a task view?
  8. After an R package is installed, how do you find its related directory? What is the usage to find its related directory? You could use the R package called healthcare as an example. Note that the package is about tools for healthcare machine learning.
  9. Find out more details about the task views related to the subject of Econometrics. Then install all related R packages. How many are there?
  10. How do we update one R package? How would we do it for Octave?
  11. How do we find the manual for...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Science with Anaconda
Published in: May 2018Publisher: PacktISBN-13: 9781788831192
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan