Home Data Python Data Analysis - Third Edition

Python Data Analysis - Third Edition

By Avinash Navlani , Armando Fandango , Ivan Idris
books-svg-icon Book
eBook $26.99 $17.99
Print $38.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $26.99 $17.99
Print $38.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Getting Started with Python Libraries
About this book
Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.
Publication date:
February 2021
Publisher
Packt
Pages
478
ISBN
9781789955248

 
Getting Started with Python Libraries

As you already know, Python has become one of the most popular, standard languages and is a complete package for data science-based operations. Python offers numerous libraries, such as NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib, Seaborn, and Plotly. These libraries provide a complete ecosystem for data analysis that is used by data analysts, data scientists, and business analysts. Python also offers other features, such as flexibility, being easy to learn, faster development, a large active community, and the ability to work on complex numeric, scientific, and research applications. All these features make it the first choice for data analysis.

In this chapter, we will focus on various data analysis processes, such as KDD, SEMMA, and CRISP-DM. After this, we will provide a comparison between data analysis and data science, as well as the roles and different skillsets for data analysts and data scientists. Finally, we will shift our focus and start installing various Python libraries, IPython, Jupyter Lab, and Jupyter Notebook. We will also look at various advanced features of Jupyter Notebooks.

In this introductory chapter, we will cover the following topics:

  • Understanding data analysis
  • The standard process of data analysis
  • The KDD process
  • SEMMA
  • CRISP-DM
  • Comparing data analysis and data science
  • The skillsets of data analysts and data scientists
  • Installing Python 3
  • Software used in this book
  • Using IPython as a shell
  • Using Jupyter Lab
  • Using Jupyter Notebooks
  • Advanced features of Jupyter Notebooks

Let's get started!

 

Understanding data analysis

The 21st century is the century of information. We are living in the age of information, which means that almost every aspect of our daily life is generating data. Not only this, but business operations, government operations, and social posts are also generating huge data. This data is accumulating day by day due to data being continually generated from business, government, scientific, engineering, health, social, climate, and environmental activities. In all these domains of decision-making, we need a systematic, generalized, effective, and flexible system for the analytical and scientific process so that we can gain insights into the data that is being generated.

In today's smart world, data analysis offers an effective decision-making process for business and government operations. Data analysis is the activity of inspecting, pre-processing, exploring, describing, and visualizing the given dataset. The main objective of the data analysis process is to discover the required information for decision-making. Data analysis offers multiple approaches, tools, and techniques, all of which can be applied to diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

  • NumPy: This is a short form of numerical Python. It is the most powerful scientific library available in Python for handling multidimensional arrays, matrices, and methods in order to compute mathematics efficiently.
  • SciPy: This is also a powerful scientific computing library for performing scientific, mathematical, and engineering operations.
  • Pandas: This is a data exploration and manipulation library that offers tabular data structures such as DataFrames and various methods for data analysis and manipulation.
  • Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a machine learning library that offers a variety of supervised and unsupervised algorithms, such as regression, classification, dimensionality reduction, cluster analysis, and anomaly detection.
  • Matplotlib: This is a core data visualization library and is the base library for all other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts, and figures for data exploration. It runs on top of NumPy and SciPy.
  • Seaborn: This is based on Matplotlib and offers easy to draw, high-level, interactive, and more organized plots.
  • Plotly: Plotly is a data visualization library. It offers high quality and interactive graphs, such as scatter charts, line charts, bar charts, histograms, boxplots, heatmaps, and subplots.

Installation instructions for the required libraries and software will be provided throughout this book when they're needed. In the meantime, let's discuss various data analysis processes, such as the standard process, KDD, SEMMA, and CRISP-DM.

 

The standard process of data analysis

Data analysis refers to investigating the data, finding meaningful insights from it, and drawing conclusions. The main goal of this process is to collect, filter, clean, transform, explore, describe, visualize, and communicate the insights from this data to discover decision-making information. Generally, the data analysis process is comprised of the following phases:

  1. Collecting Data: Collect and gather data from several sources.
  2. Preprocessing Data: Filter, clean, and transform the data into the required format.
  3. Analyzing and Finding Insights: Explore, describe, and visualize the data and find insights and conclusions.
  4. Insights Interpretations: Understand the insights and find the impact each variable has on the system.
  5. Storytelling: Communicate your results in the form of a story so that a layman can understand them.

We can summarize these steps of the data analysis process via the following process diagram:

In this section, we have covered the standard data analysis process, which emphasizes finding interpretable insights and converting them into a user story. In the next section, we will discuss the KDD process.

 

The KDD process

The KDD acronym stands for knowledge discovery from data or Knowledge Discovery in Databases. Many people treat KDD as one synonym for data mining. Data mining is referred to as the knowledge discovery process of interesting patterns. The main objective of KDD is to extract or discover hidden interesting patterns from large databases, data warehouses, and other web and information repositories. The KDD process has seven major phases:

  1. Data Cleaning: In this first phase, data is preprocessed. Here, noise is removed, missing values are handled, and outliers are detected.
  2. Data Integration: In this phase, data from different sources is combined and integrated together using data migration and ETL tools.
  3. Data Selection: In this phase, relevant data for the analysis task is recollected.
  1. Data Transformation: In this phase, data is engineered in the required appropriate form for analysis.
  2. Data Mining: In this phase, data mining techniques are used to discover useful and unknown patterns.
  3. Pattern Evaluation: In this phase, the extracted patterns are evaluated.
  4. Knowledge Presentation: After pattern evaluation, the extracted knowledge needs to be visualized and presented to business people for decision-making purposes.

The complete KDD process is shown in the following diagram:

KDD is an iterative process for enhancing data quality, integration, and transformation to get a more improved system. Now, let's discuss the SEMMA process.

 

SEMMA

The SEMMA acronym's full form is Sample, Explore, Modify, Model, and Assess. This sequential data mining process is developed by SAS. The SEMMA process has five major phases:

  1. Sample: In this phase, we identify different databases and merge them. After this, we select the data sample that's sufficient for the modeling process.
  2. Explore: In this phase, we understand the data, discover the relationships among variables, visualize the data, and get initial interpretations.
  3. Modify: In this phase, data is prepared for modeling. This phase involves dealing with missing values, detecting outliers, transforming features, and creating new additional features.
  4. Model: In this phase, the main concern is selecting and applying different modeling techniques, such as linear and logistic regression, backpropagation networks, KNN, support vector machines, decision trees, and Random Forest.
  5. Assess: In this last phase, the predictive models that have been developed are evaluated using performance evaluation measures.

The following diagram shows this process:

The preceding diagram shows the steps involved in the SEMMA process. SEMMA emphasizes model building and assessment. Now, let's discuss the CRISP-DM process.

 

CRISP-DM

CRISP-DM's full form is CRoss-InduStry Process for Data Mining. CRISP-DM is a well-defined, well-structured, and well-proven process for machine learning, data mining, and business intelligence projects. It is a robust, flexible, cyclic, useful, and practical approach to solving business problems. The process discovers hidden valuable information or patterns from several databases. The CRISP-DM process has six major phases:

  1. Business Understanding: In this first phase, the main objective is to understand the business scenario and requirements for designing an analytical goal and initial action plan.
  2. Data Understanding: In this phase, the main objective is to understand the data and its collection process, perform data quality checks, and gain initial insights.
  3. Data Preparation: In this phase, the main objective is to prepare analytics-ready data. This involves handling missing values, outlier detection and handling, normalizing data, and feature engineering. This phase is the most time-consuming for data scientists/analysts.
  4. Modeling: This is the most exciting phase of the whole process since this is where you design the model for prediction purposes. First, the analyst needs to decide on the modeling technique and develop models based on data.
  5. Evaluation: Once the model has been developed, it's time to assess and test the model's performance on validation and test data using model evaluation measures such as MSE, RMSE, R-Square for regression and accuracy, precision, recall, and the F1-measure.
  6. Deployment: In this final phase, the model that was chosen in the previous step will be deployed to the production environment. This requires a team effort from data scientists, software developers, DevOps experts, and business professionals.

The following diagram shows the full cycle of the CRISP-DM process:

The standard process focuses on discovering insights and making interpretations in the form of a story, while KDD focuses on data-driven pattern discovery and visualizing this. SEMMA majorly focuses on model building tasks, while CRISP-DM focuses on business understanding and deployment. Now that we know about some of the processes surrounding data analysis, let's compare data analysis and data science to find out how they are related, as well as what makes them different from one other.

 

Comparing data analysis and data science

Data analysis is the process in which data is explored in order to discover patterns that help us make business decisions. It is one of the subdomains of data science. Data analysis methods and tools are widely utilized in several business domains by business analysts, data scientists, and researchers. Its main objective is to improve productivity and profits. Data analysis extracts and queries data from different sources, performs exploratory data analysis, visualizes data, prepares reports, and presents it to the business decision-making authorities.

On the other hand, data science is an interdisciplinary area that uses a scientific approach to extract insights from structured and unstructured data. Data science is a union of all terms, including data analytics, data mining, machine learning, and other related domains. Data science is not only limited to exploratory data analysis and is used for developing models and prediction algorithms such as stock price, weather, disease, fraud forecasts, and recommendations such as movie, book, and music recommendations.

The roles of data analysts and data scientists

A data analyst collects, filters, processes, and applies the required statistical concepts to capture patterns, trends, and insights from data and prepare reports for making decisions. The main objective of the data analyst is to help companies solve business problems using discovered patterns and trends. The data analyst also assesses the quality of the data and handles the issues concerning data acquisition. A data analyst should be proficient in writing SQL queries, finding patterns, using visualization tools, and using reporting tools Microsoft Power BI, IBM Cognos, Tableau, QlikView, Oracle BI, and more.

Data scientists are more technical and mathematical than data analysts. Data scientists are research- and academic-oriented, whereas data analysts are more application-oriented. Data scientists are expected to predict a future event, whereas data analysts extract significant insights out of data. Data scientists develop their own questions, while data analysts find answers to given questions. Finally, data scientists focus on what is going to happen, whereas data analysts focus on what has happened so far. We can summarize these two roles using the following table:

Features

Data Scientist

Data Analyst

Background

Predict future events and scenarios based on data

Discover meaningful insights from the data.

Role

Formulate questions that can profit the business

Solve the business questions to make decisions.

Type of data

Work on both structured and unstructured data

Only work on structured data

Programming

Advanced programming

Basic programming

Skillset

Knowledge of statistics, machine learning algorithms, NLP, and deep learning

Knowledge of statistics, SQL, and data visualization

Tools

R, Python, SAS, Hadoop, Spark, TensorFlow, and Keras

Excel, SQL, R, Tableau, and QlikView

Now that we know what defines a data analyst and data scientist, as well as how they are different from each other, let's have a look at the various skills that you would need to become one of them.

 

The skillsets of data analysts and data scientists

A data analyst is someone who discovers insights from data and creates value out of it. This helps decision-makers understand how the business is performing. Data analysts must acquire the following skills:

  • Exploratory Data Analysis (EDA): EDA is an essential skill for data analysts. It helps with inspecting data to discover patterns, test hypotheses, and assure assumptions.
  • Relational Database: Knowledge of at least one of the relational database tools, such as MySQL or Postgre, is mandatory. SQL is a must for working on relational databases.
  • Visualization and BI Tools: A picture speaks more than words. Visuals have more of an impact on humans and visuals are a clear and easy option for representing the insights. Visualization and BI tools such as Tableau, QlikView, MS Power BI, and IBM Cognos can help analysts visualize and prepare reports.
  • Spreadsheet: Knowledge of MS Excel, WPS, Libra, or Google Sheets is mandatory for storing and managing data in tabular form.
  • Storytelling and Presentation Skills: The art of storytelling is another necessary skill. A data analyst should be an expert in connecting data facts to an idea or an incident and turning it into a story.

On the other hand, the primary job of a data scientist is to solve problems using data. In order to do this, they need to understand the client's requirements, their domain, their problem space, and ensure that they get exactly what they really want. The tasks that data scientists undertake vary from company to company. Some companies use data analysts and offer the title of data scientist just to glorify the job designation. Some combine data analyst tasks with data engineers and offer data scientists designation; others assign them to machine learning-intensive tasks with data visualizations.

The task of the data scientist varies, depending on the company. Some employ data scientists as well-known data analysts and combine their responsibilities with data engineers. Others give them the task of performing intensive data visualization on machines.

A data scientist has to be a jack of all trades and wear multiple hats, including those of a data analyst, statistician, mathematician, programmer, ML, or NLP engineer. Most people are not skilled enough or experts in all these trades. Also, getting skilled enough requires lots of effort and patience. This is why data science cannot be learned in 3 or 6 months. Learning data science is a journey. A data scientist should have a wide variety of skills, such as the following:

  • Mathematics and Statistics: Most machine learning algorithms are based on mathematics and statistics. Knowledge of mathematics helps data scientists develop custom solutions.
  • Databases: Knowledge of SQL allows data scientists to interact with the database and collect the data for prediction and recommendation.
  • Machine Learning: Knowledge of supervised machine learning techniques such as regression analysis, classification techniques, and unsupervised machine learning techniques such as cluster analysis, outlier detection, and dimensionality reduction.
  • Programming Skills: Knowledge of programming helps data scientists automate their suggested solutions. Knowledge of Python and R is recommended.
  • Storytelling and Presentation skills: Communicating the results in the form of storytelling via PowerPoint presentations.
  • Big Data Technology: Knowledge of big data platforms such as Hadoop and Spark helps data scientists develop big data solutions for large-scale enterprises.
  • Deep Learning Tools: Deep learning tools such as Tensorflow and Keras are utilized in NLP and image analytics.

Apart from these skillsets, knowledge of web scraping packages/tools for extracting data from diverse sources and web application frameworks such as Flask or Django for designing prototype solutions is also obtained. It is all about the skillset for data science professionals.

Now that we have covered the basics of data analysis and data science, let's dive into the basic setup needed to get started with data analysis. In the next section, we'll learn how to install Python.

 

Installing Python 3

The installer file for installing Python 3 can easily be downloaded from the official website (https://www.python.org/downloads/) for Windows, Linux, and Mac 32-bit or 64-bit systems. The installer can be installed by double-clicking on it. This installer also has an IDE named "IDLE" that can be used for development. We will dive deeper into each of the operating systems in the next few sections.

Python installation and setup on Windows

This book is based on the latest Python 3 version. All the code that will be used in this book is written in Python 3, so we need to install Python 3 before we can start coding. Python is an open source, distributed, and freely available language. It is also licensed for commercial use. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard Python implementation, which is guaranteed to be compatible with NumPy.

You can download Python 3.9.x from the Python official website: https://www.python.org/downloads/. Here, you can find installation files for Windows, Linux, Mac OS X, and other OS platforms. You can find instructions for installing and using Python for various operating systems at https://docs.python.org/3.7/using/index.html.

You need to have Python 3.5.x or above installed on your system. The sunset date for Python 2.7 was moved from 2015 to 2020, but at the time of writing, Python 2.7 will not be supported and maintained by the Python community.

At the time of writing this book, we had Python 3.8.3 installed as a prerequisite on our Windows 10 virtual machine: https://www.python.org/ftp/python/3.8.3/python-3.8.3.exe.

Python installation and setup on Linux

Installing Python on Linux is significantly easier compared to the other OSes. To install the foundational libraries, run the following command-line instruction:

$ pip3 install numpy scipy pandas matplotlib jupyter notebook

It may be essential to run the sudo command before the preceding command if you don't have sufficient rights on the machine that you are using.

Python installation and setup on Mac OS X with a GUI installer

Python can be installed via the installation file from the Python official website. The installer file can be downloaded from its official web page (https://www.python.org/downloads/mac-osx/) for macOS. This installer also has an IDE named "IDLE" that can be used for development.

Python installation and setup on Mac OS X with brew

For Mac systems, you can use the Homebrew package manager to install Python. It will make it easier to install the required applications for developers, researchers, and scientists. The brew install command is used to install another application, such as installing python3 or any other Python package, such as NLTK or SpaCy.

To install the most recent version of Python, you need to execute the following command in a Terminal:

$ brew install python3

After installation, you can confirm the version of Python you've installed by running the following command:

$ python3 --version
Python 3.7.4

You can also open the Python Shell from the command line by running the following command:

$ python3

Now that we know how to install Python on our system, let's dive into the actual tools that we will need to start data analysis.

 

Software used in this book

Let's discuss the software that will be used in this book. In this book, we are going to use Anaconda IDE to analyze data. Before installing it, let's understand what Anaconda is.

A Python program can easily run on any system that has it installed. We can write a program on a Notepad and run it on the command prompt. We can also write and run Python programs on different IDEs, such as Jupyter Notebook, Spyder, and PyCharm. Anaconda is a freely available open source package containing various data manipulation IDEs and several packages such as NumPy, SciPy, Pandas, Scikit-learn, and so on for data analysis purposes. Anaconda can easily be downloaded and installed, as follows:

  1. Download the installer from https://www.anaconda.com/distribution/.
  2. Select the operating system that you are using.
  3. From the Python 3.7 section, select the 32-bit or 64-bit installer option and start downloading.
  4. Run the installer by double-clicking on it.
  5. Once the installation is complete, check your program in the Start menu or search for Anaconda in the Start menu.

Anaconda also has an Anaconda Navigator, which is a desktop GUI application that can be used to launch applications such as Jupyter Notebook, Spyder, Rstudio, Visual Studio Code, and JupyterLab:

Now, let's look at IPython, a shell-based computing environment for data analysis.

 

Using IPython as a shell

IPython is an interactive shell that is equivalent to an interactive computing environment such as Matlab or Mathematica. This interactive shell was created for the purpose of quick experimentation. It is a very useful tool for data professionals that are performing small experiments.

IPython shell offers the following features:

  • Easy access to system commands.
  • Easy editing of inline commands.
  • Tab completion, which helps you find commands and speed up your task.
  • Command History, which helps you view previously used commands.
  • Easily execute external Python scripts.
  • Easy debugging with the Python debugger.

Now, let's execute some commands on IPython. To start IPython, use the following command on the command line:

$ ipython3

When you run the preceding command, the following window will appear:

Now, let's understand and execute some commands that the IPython shell provides:

  • History Commands: The history command used to check the list of previously used commands. The following screenshot shows how to use the history command in IPython:
  • System Commands: We can also run system commands from IPython using the exclamation sign (!). Here, the input command after the exclamation sign is considered a system command. For example, !date will display the current date of the system, while !pwd will show the current working directory:
  • Writing Function: We can write functions as we would write them in any IDE, such as Jupyter Notebook, Python IDLE, PyCharm, or Spyder. Let's look at an example of a function:
  • Quit Ipython Shell: You can exit or quit the IPython shell using quit() or exit() or CTRL + D:

You can also quit the IPython shell using the quit() command:

In this subsection, we have looked at a few basic commands we can use on the IPython shell. Now, let's discuss how we can use the help command in the IPython shell.

Reading manual pages

In the IPython shell, we can open a list of available commands using the help command. It is not compulsory to write the full name of the function. You can just type in a few initial characters and then press the tab button, and it will find the word you are looking for. For example, let's use the arrange() function. There are two ways we can find help about functions:

  • Use the help function: Let's type help and write a few initial characters of the function. After that, press the tab key, select a function using the arrow keys, and press the Enter key:
  • Use a question mark: We can also use a question mark after the name of the function. The following screenshot shows an example of this:

In this subsection, we looked at the help and question mark support that's provided for module functions. We can also get help from library documentation. Let's discuss how to get documentation for data analysis in Python libraries.

Where to find help and references to Python data analysis libraries

The following table lists the documentation websites for the Python data analysis libraries we have discussed in this chapter:

Packages/Software

Description

NumPy

https://numpy.org/doc/

SciPy

https://docs.scipy.org/doc/

Pandas

https://pandas.pydata.org/docs/

Matplotlib

https://matplotlib.org/3.2.1/contents.html

Seaborn

https://seaborn.pydata.org/

Scikit-learn

https://scikit-learn.org/stable/

Anaconda

https://www.anaconda.com/distribution/

You can also find answers to various Python programming questions related to NumPy, SciPy, Pandas, Matplotlib, Seaborn, and Scikit-learn on the StackOverflow platform. You can also raise issues related to the aforementioned libraries on GitHub.

 

Using JupyterLab

JupyterLab is a next-generation web-based user interface. It offers a combination of data analysis and machine learning product development tools such as a Text Editor, Notebooks, Code Consoles, and Terminals. It's a flexible and powerful tool that should be a part of any data analyst's toolkit:

You can install JupyterLab using conda, pip, or pipenv.

To install using conda, we can use the following command:

$ conda install -c conda-forge jupyterlab

To install using pip, we can use the following command:

$ pip install jupyterlab

To install using pipenv, we can use the following command:

$ pipenv install jupyterlab

In this section, we have learned how to install Jupyter Lab. In the next section, we will focus on Jupyter Notebooks.

 

Using Jupyter Notebooks

Jupyter Notebook is a web application that's used to create data analysis notebooks that contain code, text, figures, links, mathematical equations, and charts. Recently, the community introduced the next generation of web-based Jupyter Notebooks, called JupyterLab. You can take a look at these notebook collections at the following links:

Often, these notebooks are used as educational tools or to demonstrate Python software. We can import or export notebooks either from plain Python code or from the special notebook format. The notebooks can be run locally, or we can make them available online by running a dedicated notebook server. Certain cloud computing solutions, such as Wakari, PiCloud, and Google Colaboratory, allow you to run notebooks in the cloud.

"Jupyter" is an acronym that stands for Julia, Python, and R. Initially, the developers implemented it for these three languages, but now, it is used for various other languages, including C, C++, Scala, Perl, Go, PySpark, and Haskell:

Jupyter Notebook offers the following features:

  • It has the ability to edit code in the browser with proper indentation.
  • It has the ability to execute code from the browser.
  • It has the ability to display output in the browser.
  • It can render graphs, images, and videos in cell output.
  • It has the ability to export code in PDF, HTML, Python file, and LaTex format.

We can also use both Python 2 and 3 in Jupyter Notebooks by running the following commands in the Anaconda prompt:

# For Python 2.7
conda create -n py27 python=2.7 ipykernel

# For Python 3.5
conda create -n py35 python=3.5 ipykernel

Now that we now about various tools and libraries and also have installed Python, let's move on to some of the advanced features in the most commonly used tool, Jupyter Notebooks.

 

Advanced features of Jupyter Notebooks

Jupyter Notebook offers various advanced features, such as keyboard shortcuts, installing other kernels, executing shell commands, and using various extensions for faster data analysis operations. Let's get started and understand these features one by one.

Keyboard shortcuts

Users can find all the shortcut commands that can be used inside Jupyter Notebook by selecting the Keyboard Shortcuts option in the Help menu or by using the Cmd + Shift + P shortcut key. This will make the quick select bar appear, which contains all the shortcuts commands, along with a brief description of each. It is easy to use the bar and users can use it when they forget something:

Installing other kernels

Jupyter has the ability to run multiple kernels for different languages. It is very easy to set up an environment for a particular language in Anaconda. For example, an R kernel can be set by using the following command in Anaconda:

$ conda install -c r r-essentials

The R kernel should then appear, as shown in the following screenshot:

Running shell commands

In Jupyter Notebook, users can run shell commands for Unix and Windows. The shell offers a communication interface for talking with the computer. The user needs to put ! (an exclamation sign) before running any command:

Extensions for Notebook

Notebook extensions (or nbextensions) add more features compared to basic Jupyter Notebooks. These extensions improve the user's experience and interface. Users can easily select any of the extensions by selecting the NBextensions tab.

To install nbextension in Jupyter Notebook using conda, run the following command:

conda install -c conda-forge jupyter_nbextensions_configurator

To install nbextension in Jupyter Notebook using pip, run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

If you get permission errors on macOS, just run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install --user

All the configurable nbextensions will be shown in a different tab, as shown in the following screenshot:

Now, let's explore a few useful features of Notebook extensions:

  • Hinterland: This provides an autocompleting menu for each keypress that's made in cells and behaves like PyCharm:
  • Table of Contents: This extension shows all the headings in the sidebar or navigation menu. It is resizable, draggable, collapsible, and dockable:

  • Execute Time: This extension shows when the cells were executed and how much time it will take to complete the cell code:
  • Spellchecker: Spellchecker checks and verifies the spellings that are written in each cell and highlights any incorrectly written words.
  • Variable Selector: This extension keeps track of the user's workspace. It shows the names of all the variables that the user created, along with their type, size, shape, and value.
  • Slideshow: Notebook results can be communicated via Slideshow. This is a great tool for telling stories. Users can easily convert Jupyter Notebooks into slides without the use of PowerPoint. As shown in the following screenshot, Slideshow can be started using the Slideshow option in the cell toolbar of the view menu:

Jupyter Notebook also allows you to show or hide any cell in Slideshow. After adding the Slideshow option to the cell toolbar of the view menu, you can use a Slide Type drop-down list in each cell and select various options, as shown in the following screenshot:

  • Embedding PDF documents: Jupyter Notebook users can easily add PDF documents. The following syntax needs to be run for PDf documents:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1811.02141.pdf', width=700, height=400)

This results in the following output:

  • Embedding Youtube Videos: Jupyter Notebook users can easily add YouTube videos. The following syntax needs to be run for adding YouTube videos:
from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU', width=700, height=400)

This results in the following output:

With that, you now understand data analysis, the process that's undertaken by it, and the roles that it entails. You have also learned how to install Python and use Jupyter Lab and Jupyter Notebook. You will learn more about various Python libraries and data analysis techniques in the upcoming chapters.

 

Summary

In this chapter, we have discussed various data analysis processes, including KDD, SEMMA, and CRISP-DM. We then discussed the roles and skillsets of data analysts and data scientists. After that, we installed NumPy, SciPy, Pandas, Matplotlib, IPython, Jupyter Notebook, Anaconda, and Jupyter Lab, all of which we will be using in this book. Instead of installing all those modules, you can install Anaconda or Jupyter Lab, which has NumPy, Pandas, SciPy, and Scikit-learn built-in.

Then, we got a vector addition program working and learned how NumPy offers superior performance compared to the other libraries. We explored the available documentation and online resources. In addition, we discussed Jupyter Lab, Jupyter Notebook, and their features.

In the next chapter, Chapter 2, NumPy and Pandas, we will take a look at NumPy and Pandas under the hood and explore some of the fundamental concepts surrounding arrays and DataFrames.

About the Authors
  • Avinash Navlani

    Avinash Navlani has over 8 years of experience working in data science and AI. Currently, he is working as a senior data scientist, improving products and services for customers by using advanced analytics, deploying big data analytical tools, creating and maintaining models, and onboarding compelling new datasets. Previously, he was a university lecturer, where he trained and educated people in data science subjects such as Python for analytics, data mining, machine learning, database management, and NoSQL. Avinash has been involved in research activities in data science and has been a keynote speaker at many conferences in India.

    Browse publications by this author
  • Armando Fandango

    Dr. Armando creates AI-empowered products by leveraging reinforcement learning, deep learning, and distributed computing. Armando has provided thought leadership in diverse roles at small and large enterprises, including Accenture, Nike, Sonobi, and IBM, along with advising high-tech AI-based start-ups. Armando has authored several books, including Mastering TensorFlow, TensorFlow Machine Learning Projects, and Python Data Analysis, and has published research in international journals and presented his research at conferences. Dr. Armando’s current research and product development interests lie in the areas of reinforcement learning, deep learning, edge AI, and AI in simulated and real environments (VR/XR/AR).

    Browse publications by this author
  • Ivan Idris

    Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.

    Browse publications by this author
Latest Reviews (1 reviews total)
This new edition becomes updated. Really good content.
Python Data Analysis - Third Edition
Unlock this book and the full library FREE for 7 days
Start now