Reader small image

You're reading from  Mastering pandas. - Second Edition

Product typeBook
Published inOct 2019
Reading LevelIntermediate
Publisher
ISBN-139781789343236
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Ashish Kumar
Ashish Kumar
author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

Right arrow

Installation of pandas and Supporting Software

Before we can start work on pandas for doing data analysis, we need to make sure that the software is installed and the environment is in proper working order. This chapter deals with the installation of Python (if necessary), the pandas library, and all necessary dependencies for the Windows, macOS/X, and Linux platforms. The topics we address include, among other things, selecting a version of Python, installing Python, and installing pandas.

The steps outlined in the following section should work for the most part, but your mileage may vary depending upon the setup. On different operating system versions, the scripts may not always work perfectly, and the third-party software packages already in the system may sometimes conflict with the instructions provided.

The following topics will be covered in this chapter:

  • Selecting a...

Selecting a version of Python to use

This is a classic battle among Python developers—Python 2.7.x or Python 3.x—which is better? Until a year back, it was Python 2.7.x that topped the charts; the reason being it was a stable version. More than 70% of projects used Python 2.7, in the year 2016. This number began to fall and by 2017 it was 63%. This shift in trends was due to the announcement that Python 2.7 would not be maintained from January 1, 2018, meaning that there would be no more bug fixes or new releases. Some libraries released after this announcement are only compatible with Python 3.x. Several businesses have started migrating towards Python 3.x. Hence, as of 2018, Python 3.x is the preferred version.

For further information, please see https://wiki.python.org/moin/Python2orPython3.

The main differences between Python 2.x and 3 include better Unicode...

Standalone Python installation

Here, we detail the standalone installation of Python on multiple platforms—Linux, Windows, and macOS/X. Standalone means just the IDLE IDE, interpreter, and some basic packages. Another option is to download from a distribution, which is a richer version and comes pre-installed with many utilities.

Linux

If you're using Linux, Python will most probably come pre-installed. If you're not sure, type the following at Command Prompt:

       which python

Python is likely to be found in one of the following folders on Linux, depending on your distribution and particular installation:

  • /usr/bin/python
  • /bin/python
  • /usr/local/bin/python
  • /opt/local/bin/python

You...

Installation of Python and pandas using Anaconda

After a standalone installation of Python, each library will have to be separately installed. It is a bit of a hassle to ensure version compatibility between newly installed libraries and the associated dependencies. This is where a third-party distribution like Anaconda comes in handy. Anaconda is the most widely used distribution for Python/R, designed for developing scalable data science solutions.

What is Anaconda?

Anaconda is an open source Python/R distribution, developed to seamlessly manage packages, dependencies and environments. It is compatible with Windows, Linux and macOS and requires 3 GB of disk space. It needs this memory to download and install quite a collection...

Dependency packages for pandas

Please note that if you are using Anaconda distribution, you don't need to install pandas separately and hence don't need to worry about installing the dependencies. It is still good to know the dependency packages that are being used behind the hood in pandas to better understand the functioning.

At the time of writing, the latest stable version of pandas is the 0.23.4 version. The various dependencies along with the associated download locations are as follows:

Package

Required

Description

Download location

NumPy : 1.9.0 or higher

Required

NumPy library for numerical operations

http://www.numpy.org/

python-dateutil 2.5.0

Required

Date manipulation and utility library

http://labix.org/

Pytz

Required

Time zone support

http://sourceforge.net/

Setuptools 24.2.0

Required

Packaging Python projects...

Review of items installed with Anaconda

Anaconda installs more than 200 packages and several IDEs. Some of the widely used packages that get installed are: NumPy, pandas, scipy, scikit-learn, matplotlib, seaborn, beautifulsoup4, nltk, and dask.

Packages, which are not installed along with Anaconda, could be installed manually through Conda, Anaconda's package manager. Any package upgradation can also be done through Conda. Conda will fetch the packages from the Anaconda repository, which is huge and has more than 1400 packages. The following commands will install and update packages through conda:

  • To install, use conda install pandas
  • To update, use conda update pandas

The following IDEs are installed with Anaconda:

  • JupyterLab
  • Jupyter Notebook
  • QTConsole
  • Spyder

The IDEs could be launched either through Conda or Anaconda Navigator.

Anaconda Navigator is a GUI that lets...

Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio

Pandas can be regarded as a "wonder tool" when it comes to applications like data manipulation, data cleaning, or handling time series data. It is extremely fast and efficient, and it is powerful enough to handle small to intermediate datasets. The best part is that the use of pandas is not restricted just to Python. There are methods enabling the supremacy of pandas to be utilized in other frameworks, like R, Julia, Azure ML Studio and H20.ai. These methods of using the benefits of a superior framework from another tool is called cross-tooling and is frequently applied. One of the main reasons for this to exist is that it is almost impossible for one tool to have all the functionalities. Suppose one task has two sub-tasks: sub-task 1 can be done well in R while the sub-task...

Command line tricks for pandas

The command line is an important arsenal for pandas users. The command line can be used as an efficient and faster but tedious-to-use complement/supplement to pandas. Many of the data operations, like breaking a huge file into multiple chunks, cleaning a data file of unsupported characters, and so on, can be performed in the command line before feeding the data to pandas.

The head function of pandas is extremely useful to quickly assess the data. A command line function for head makes this option even more useful:

# Get the first 10 rows
$ head myData.csv

# Get the first 5 rows
$ head -n 5 myData.csv

# Get 100 bytes of data
$ head -c 100 myData.csv

The translate (tr) function packs within it the ability to replace characters. The following command converts all uppercase characters in a text file to lowercase characters:

$ cat upper.txt | tr "[:upper...

Options and settings for pandas

pandas allows the users to modify some display and formatting options.

The get_option() and set_option() commands let the user view the current setting and change it:

pd.get_option("display.max_rows")
Output: 60

pd.set_option("display.max_rows", 120)
pd.get_option("display.max_rows")
Output: 120

pd.reset_option("display.max_rows")
pd.get_option("display.max_rows")
Output: 60

The preceding options discussed set and reset the number of rows that are displayed when a dataframe is printed. Some of the other useful display options are the following:

  • max_columns: Set the number of columns to be displayed.
  • chop_threshold: Float values below the limit set here will be displayed as zeros.
  • colheader_justify: Set the justification for the column header.
  • date_dayfirst: Setting to 'True' prints day first...

Summary

Before we delve into the awesomeness of pandas, it is mission critical that we install Python and pandas correctly, choose the right IDEs, and set the right options. In this chapter, we discussed these and more. Here is a summary of key takeaways from the chapter:

  • Python 3.x is available, but many users still prefer to use version 2.7 as it is more stable and scientific-computation friendly.
  • The support and bug fixing for version 2.7 has now been stopped.
  • Translating code from one version to other is a breeze. One can also use both versions together using the virtualenv package, which comes pre-installed with Anaconda.
  • Anaconda is a popular Python distribution that comes with 700+ libraries/packages and several popular IDEs, such as Jupyter and Spyder.
  • Python codes are callable from, and usable in, other tools, like R, Azure ML Studio, H20.ai, and Julia.
  • Some of the day...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering pandas. - Second Edition
Published in: Oct 2019Publisher: ISBN-13: 9781789343236
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar