2. Data Exploration with Jupyter
Overview
In this chapter, we'll finally get our hands on some data and work through an exploratory analysis, where we'll compute some informative metrics and visualizations. By the end of this chapter, you will be able to use the pandas Python library to load tabular data and run calculations on it, and the seaborn
Python library to create visualizations.
Introduction
So far, we have taken a glance at the data science ecosystem and jumped into learning about Jupyter, the tool that we'll be using throughout this book for our coding exercises and activities. Now, we'll shift our focus away from learning about Jupyter and start actually using it for analysis.
Data visualization and exploration are important steps in the data science process. This is how you can learn about your data and make sure you understand it completely. Visualizations can be used as a means of discovering unusual records in datasets and presenting that information to others.
In addition to understanding and gaining fundamental trust in data, your analysis may lead to the discovery of patterns and insights in the data. In some cases, these patterns can prompt further research and ultimately be very beneficial to your business.
Applied knowledge of a high-level programming language such as Python or R will make datasets accessible to you, from...
Our First Analysis – the Boston Housing Dataset
The dataset we'll be looking at in this section is the so-called Boston Housing dataset. It contains US census data concerning houses in various areas around the city of Boston. Each sample corresponds to a unique area and has about a dozen measures. We should think of samples as rows and measures as columns. This data was first published in 1978 and is quite small, containing only about 500 samples.
Now that we know something about the context of the dataset, let's decide on a rough plan for the exploration and analysis stages. If applicable, this plan will accommodate the relevant questions under study. In this case, the goal is not to answer a question, but to show Jupyter in action and illustrate some basic data analysis methods.
Our general approach to this analysis will be to do the following:
- Load the data into Jupyter using a pandas DataFrame
- Quantitatively understand the features
- Look for...
Summary
In this chapter, we ran an exploratory analysis in a live Jupyter Notebook environment. In doing so, we used visualizations such as scatter plots, histograms, and violin plots to deepen our understanding of the data. We also performed simple predictive modeling, a topic that will be the focus of the following chapters in this book.
In the next chapter, we will discuss how to approach predictive analytics and what things to consider when preparing the data for modeling. We'll use pandas to explore methods of data preprocessing, such as filling missing data, converting from categorical to numeric features, and splitting data into training and testing sets.