Welcome to Learning pandas! In this book, we will go on a journey that will see us learning pandas, an open source data analysis library for the Python programming language. The pandas library provides high-performance and easy-to-use data structures and analysis tools built with Python. pandas brings to Python many good things from the statistical programming language R, specifically data frame objects and R packages such as plyr and reshape2, and places them in a single library that you can use from within Python.
In this first chapter, we will take the time to understand pandas and how it fits into the bigger picture of data analysis. This will give the reader who is interested in pandas a feeling for its place in the bigger picture of data analysis instead of having a complete focus on the details of using pandas. The goal is that while learning pandas you also learn why those features exist in support of performing data analysis tasks.
So, let's jump in. In this chapter, we will cover:
- What pandas is, why it was created, and what it gives you
- How pandas relates to data analysis and data science
- The processes involved in data analysis and how it is supported by pandas
- General concepts of data and analytics
- Basic concepts of data analysis and statistical analysis
- Types of data and their applicability to pandas
- Other libraries in the Python ecosystem that you will likely use with pandas
pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.
The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.
pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:
- Representing security data, such as a stock's price, as it changes over time
- Matching the measurement of multiple streams of data at identical times
- Determining the relationship (correlation) of two or more streams of data
- Representing times and dates as first-class entities
- Converting the period of samples of data, either up or down
To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:
- Fast and efficient Series and DataFrame objects for data manipulation with integrated indexing
- Intelligent data alignment using indexes and labels
- Integrated handling of missing data
- Facilities for converting messy data into orderly data (tidying)
- Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
- The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
- Flexible reshaping and pivoting of sets of data
- Smart label-based slicing, fancy indexing, and subsetting of large datasets
- Columns can be inserted and deleted from data structures for size mutability
- Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
- High-performance merging and joining of datasets
- Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
- Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
- Highly optimized for performance, with critical code paths written in Cython or C
The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.
Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.
Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.
We live in a world in which massive amounts of data are produced and stored every day. This data comes from a plethora of information systems, devices, and sensors. Almost everything you do, and items you use to do it, produces data which can be, or is, captured.
This has been greatly enabled by the ubiquitous nature of services that are connected to networks, and by the great increases in data storage facilities; this, combined with the ever-decreasing cost of storage, has made capturing and storing even the most trivial of data effective.
This has led to massive amounts of data being piled up and ready for access. But this data is spread out all over cyber-space, and is cannot actually be referred to as information. It tends to be a collected collection of the recording of events, whether financial, of your interactions with social networks, or of your personal health monitor tracking your heartbeat throughout the day. This data is stored in all kinds of formats, is located in scattered places, and beyond its raw nature does give much insight.
Logically, the overall process can be broken into three major areas of discipline:
- Data manipulation
- Data analysis
- Data science
These three disciplines can and do have a lot of overlap. Where each ends and the others begin is open to interpretation. For the purposes of this book we will define each as in the following sections.
Data is distributed all over the planet. It is stored in different formats. It has widely varied levels of quality. Because of this there is a need for tools and processes for pulling data together and into a form that can be used for decision making. This requires many different tasks and capabilities from a tool that manipulates data in preparation for analysis. The features needed from such a tool include:
- Programmability for reuse and sharing
- Access to data from external sources
- Storing data locally
- Indexing data for efficient retrieval
- Alignment of data in different sets based upon attributes
- Combining data in different sets
- Transformation of data into other representations
- Cleaning data from cruft
- Effective handling of bad data
- Grouping data into common baskets
- Aggregation of data of like characteristics
- Application of functions to calculate meaning or perform transformations
- Query and slicing to explore pieces of the whole
- Restructuring into other forms
- Modeling distinct categories of data such as categorical, continuous, discrete, and time series
- Resampling data to different frequencies
There are many data manipulation tools in existence. Each differs in support for the items on this list, how they are deployed, and how they are utilized by their users. These tools include relational databases (SQL Server, Oracle), spreadsheets (Excel), event processing systems (such as Spark), and more generic tools such as R and pandas.
Data analysis is the process of creating meaning from data. Data with quantified meaning is often called information. Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns. It often overlaps data manipulation and the distinction between the two is not always clear. Many data manipulation tools also contain analyses functions, and data analysis tools often provide data manipulation capabilities.
Data science is the process of using statistics and data analysis processes to create an understanding of phenomena within data. Data science usually starts with information and applies a more complex domain-based analysis to the information. These domains span many fields such as mathematics, statistics, information science, computer science, machine learning, classification, cluster analysis, data mining, databases, and visualization. Data science is multidisciplinary. Its methods of domain analysis are often very different and specific to a specific domain.
pandas first and foremost excels in data manipulation. All of the needs itemized earlier will be covered in this book using pandas. This is the core of pandas and is most of what we will focus on in this book.
It is worth noting that that pandas has a specific design goal: emphasizing data
But pandas does provide several features for performing data analysis. These capabilities typically revolve around descriptive statistics and functions required for finance such as correlations.
Therefore, pandas itself is not a data science toolkit. It is more of a manipulation tool with some analysis capabilities. pandas explicitly leaves complex statistical, financial, and other types of analyses to other Python libraries, such as SciPy, NumPy, scikit-learn, and leans upon graphics libraries such as matplotlib and ggvis for data visualization.
This focus is actually a strength of pandas over other languages such as R as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere by the Python community.
The primary goal of this book is to thoroughly teach you how to use pandas to manipulate data. But there is a secondary, and perhaps no less important, goal of showing how pandas fits into the processes that a data analyst/scientist performs in everyday life.
One description of the steps involved in the process of data analysis is given on the pandas web site:
- Munging and cleaning data
- Organization into a form suitable for communication
This small list is a good initial definition, but it fails to cover the overall scope of the process and why many features implemented in pandas were created. The following expands upon this process and sets the framework for what is to come throughout this journey.
The proposed process is one that will be referred to as The Data Process and is represented in the following diagram:
This process sets up a framework for defining logical steps that are taken in working with data. For now, let's take a quick look at each of these steps in the process and some of the tasks that you as a data analyst using pandas will perform.
The first step in any data problem is to identify what it is you want to figure out. This is referred to as ideation, of coming up with an idea of what we want to do and prove. Ideation generally relates to hypothesizing about patterns in data that can be used to make intelligent decisions.
These decisions are often within the context of a business, but also within other disciplines such as the sciences and research. The in-vogue thing right now is understanding the operations of businesses, as there are often copious amounts of money to be made in understanding data.
But what kinds of decision are we typically looking to make? The following are several questions for which answers are commonly asked:
- Why did something happen?
- Can we predict the future using historical data?
- How can I optimize operations in the future?
This list is by no means exhaustive, but it does cover a sizable percentage of the reasons why anyone undertakes these endeavors. To get answers to these questions, one must be involved with collecting and understanding data relative to the problem. This involves defining what data is going to be researched, what the benefit is of the research, how the data is going to be obtained, what the success criteria are, and how the information is going to be eventually communicated.
pandas itself does not provide tools to assist in ideation. But once you have gained understanding and skill in using pandas, you will naturally realize how pandas will help you in being able to formulate ideas. This is because you will be armed with a powerful tool you can used to frame many complicated hypotheses.
Once you have an idea you must then find data to try and support your hypothesis. This data can come from within your organization or from external data providers. This data normally is provided as archived data or can be provided in real-time (although pandas is not well known for being a real-time data processing tool).
Data is often very raw, even if obtained from data sources that you have created or from within your organization. Being raw means that the data can be disorganized, may be in various formats, and erroneous; relative to supporting your analysis, it may be incomplete and need manual augmentation.
There is a lot of free data in the world. Much data is not free and actually costs significant amounts of money to obtain. Some is freely available with public APIs, and the others by subscription. Data you pay for is often cleaner, but this is not always the case.
In either case, pandas provides a robust and easy-to-use set of tools for retrieving data from various sources and that may be in many different formats. pandas also gives us the ability to not only retrieve data, but to also provide an initial structuring of the data via pandas data structures without needing to manually create complex coding, which may be required in other tools or programming languages.
During preparation, raw data is made ready for exploration. This preparation is often a very interesting process. It is very frequently the case that data from is fraught with all kinds of issues related to quality. You will likely spend a lot of time dealing with these quality issues, and often this is a very non-trivial amount of time.
Why? Well there are a number of reasons:
- The data is simply incorrect
- Parts of the dataset are missing
- Data is not represented using measurements appropriate for your analysis
- The data is in formats not convenient for your analysis
- Data is at a level of detail not appropriate for your analysis
- Not all the fields you need are available from a single source
- The representation of data differs depending upon the provider
The preparation process focuses on solving these issues. pandas provides many great facilities for preparing data, often referred to as tidying up data. These facilities include intelligent means of handling missing data, converting data types, using format conversion, changing frequencies of measurements, joining data from multiple sets of data, mapping/converting symbols into shared representations, and grouping data, among many others. We will cover all of these in depth.
Exploration involves being able to interactively slice and dice your data to try and make quick discoveries. Exploration can include various tasks such as:
- Examining how variables relate to each other
- Determining how the data is distributed
- Finding and excluding outliers
- Creating quick visualizations
- Quickly creating new data representations or models to feed into more permanent and detailed modeling processes
Exploration is one of the great strengths of pandas. While exploration can be performed in most programming languages, each has its own level of ceremony—how much non-exploratory effort must be performed—before actually getting to discoveries.
When used with the read-eval-print-loop (REPL) nature of IPython and/or Jupyter notebooks, pandas creates an exploratory environment that is almost free of ceremony. The expressiveness of the syntax of pandas lets you describe complex data manipulation constructs succinctly, and the result of every action you take upon your data is immediately presented for your inspection. This allows you to quickly determine the validity of the action you just took without having to recompile and completely rerun your programs.
In the modeling stage you formalize your discoveries found during exploration into an explicit explanation of the steps and data structures required to get to the desired meaning contained within your data. This is the model, a combination of both data structures as well as steps in code to get from the raw data to your information and conclusions.
The modeling process is iterative where, through an exploration of the data, you select the variables required to support your analysis, organize the variables for input to analytical processes, execute the model, and determine how well the model supports your original assumptions. It can include a formal modeling of the structure of the data, but can also combine techniques from various analytic domains such as (and not limited to) statistics, machine learning, and operations research.
To facilitate this, pandas provides extensive data modeling facilities. It is in this step that you will move more from exploring your data, to formalizing the data model in DataFrame objects, and ensuring the processes to create these models are succinct. Additionally, by being based in Python, you get to use its full power to create programs to automate the process from beginning to end. The models you create are executable.
From an analytic perspective, pandas provides several capabilities, most notably integrated support for descriptive statistics, which can get you to your goal for many types of problems. And because pandas is Python-based, if you need more advanced analytic capabilities, it is very easy to integrate with other parts of the extensive Python scientific environment.
The penultimate step of the process is presenting your findings to others, typically in the form of a report or presentation. You will want to create a persuasive and thorough explanation of your solution. This can often be done using various plotting tools in Python and manually creating a presentation.
Jupyter notebooks are a powerful tool in creating presentations for your analyses with pandas. These notebooks provide a means of both executing code and providing rich markdown capabilities to annotate and describe the execution at multiple points in the application. These can be used to create very effective, executable presentations that are visually rich with pieces of code, stylized text, and graphics.
An important piece of research is sharing and making your research reproducible. It is often said that if other researchers cannot reproduce your experiment and results, then you didn't prove a thing.
Fortunately, for you, by having used pandas and Python, you will be able to easily make your analysis reproducible. This can be done by sharing the Python code that drives your pandas code, as well as the data.
Jupyter notebooks also provide a convenient means of packaging both the code and application in a means that can be easily shared with anyone else with a Jupyter Notebook installation. And there are many free, and secure, sharing sites on the internet that allow you to either create or deploy your Jupyter notebooks for sharing.
Something very important to understand about data manipulation, analysis, and science is that it is an iterative process. Although there is a natural forward flow along the stages previously discussed, you will end up going forwards and backwards in the process. For instance, while in the exploration phase you may identify anomalies in the data that relate to data purity issues from the preparation stage, and need to go back and rectify those issues.
This is part of the fun of the process. You are on an adventure to solve your initial problem, all the while gaining incremental insights about the data you are working with. These insights may lead you to ask new questions, to more exact questions, or to a realization that your initial questions were not the actual questions that needed to be asked. The process is truly a journey and not necessarily the destination.
The following gives a quick mapping of the steps in the process to where you will learn about them in this book. Do not fret if the steps that are earlier in the process are in later chapters. The book will walk you through this in a logical progression for learning pandas, and you can refer back from the chapters to the relevant stage in the process.
Step in process
Ideation is the creative process in data science. You need to have the idea. The fact that you are reading this qualifies you as you must be looking to analyze some data, and want to in the future.
Retrieval of data is primarily covered in Chapter 9, Accessing Data.
Preparation of data is primarily covered in Chapter 10, Tidying Up your Data, but it is also a common thread running through most of the chapters.
Exploration spans Chapter 3, Representing Univariate Data with the Series, through Chapter 15, Historical Stock Price Analysis, so most of the chapters of the book. But the most focused chapters for exploration are Chapter 14, Visualization and Chapter 15, Historical Stock Price Analysis, in both of which we begin to see the results of data analysis.
Modeling has its focus in Chapter 3, Representing Univariate Data with the pandas Series, and Chapter 4, Representing Tabular and Multivariate Data with the DataFrame with the pandas DataFrame, and also Chapter 11, Combining, Relating, and Reshaping Data through Chapter 13, Time-Series Modelling, and with a specific focus towards finance in Chapter 15, Historical Stock Price Analysis.
Presentation is the primary purpose of Chapter 14, Visualization.
Reproduction flows throughout the book, as the examples are provided as Jupyter notebooks. By working in notebooks, you are by default using a tool for reproduction and have the ability to share notebooks in various ways.
When learning pandas and data analysis you will come across many concepts in data, modeling and analysis. Let's examine several of these concepts and how they relate to pandas.
Working with data in the wild you will come across several broad categories of data that will need to be coerced into pandas data structures. They are important to understand as the tools required to work with each type vary.
pandas is inherently used for manipulating structured data but provides several tools for facilitating the conversion of non-structured data into a means we can manipulate.
Structured data is any type of data that is organized as fixed fields within a record or file, such as data in relational databases and spreadsheets. Structured data depends upon a data model, which is the defined organization and meaning of the data and often how the data should be processed. This includes specifying the type of the data (integer, float, string, and so on), and any restrictions on the data, such as the number of characters, maximum and minimum values, or a restriction to a certain set of values.
Structured data is the type of data that pandas is designed to utilize. As we will see first with the Series and then with the DataFrame, pandas organizes structured data into one or more columns of data, each of a single and specific data type, and then a series of zero or more rows of data.
Unstructured data is data that is without any defined organization and which specifically does not break down into stringently defined columns of specific types. This can consist of many types of information such as photos and graphic images, videos, streaming sensor data, web pages, PDF files, PowerPoint presentations, emails, blog entries, wikis, and word processing documents.
While pandas does not manipulate unstructured data directly, it provides a number of facilities to extract structured data from unstructured sources. As a specific example that we will examine, pandas has tools to retrieve web pages and extract specific pieces of content into a DataFrame.
Semi-structured data fits in between unstructured. It can be considered a type of structured data, but lacks the strict data model structure. JSON is a form of semi-structured data. While good JSON will have a defined format, there is no specific schema for data that is always strictly enforced. Much of the time, the data will be in a repeatable pattern that can be easily converted into structured data types like the pandas DataFrame, but the process may need some guidance from you to specify or coerce data types.
When modeling data in pandas, we will be modeling one or more variables and looking to find statistical meaning amongst the values or across multiple variables. This definition of a variable is not in the sense of a variable in a programming language but one of statistical variables.
A variable is any characteristic, number, or quantity that can be measured or counted. A variable is so named because the value may vary between data units in a population and may change in value over time. Stock value, age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye color, and vehicle type are examples of variables.
There are several broad types of statistical variables that we will come across when using pandas:
A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Each of the possible values is often referred to as a level. Categorical variables in pandas are represented by Categoricals, a pandas data type which corresponds to categorical variables in statistics. Examples of categorical variables are gender, social class, blood types, country affiliations, observation time, or ratings such as Likert scales.
A continuous variable is a variable that can take on infinitely many (an uncountable number of) values. Observations can take any value between a certain set of real numbers. Examples of continuous variables include height, time, and temperature. Continuous variables in pandas are represented by either float or integer types (native to Python), typically in collections that represent multiple samplings of the specific variable.
A discrete variable is a variable where the values are based on a count from a set of distinct whole values. A discrete variable cannot be a fractional value between any two variables. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of which measure whole units (for example 1, 2, or 3 children). Discrete variables are normally represented in pandas by integers (or occasionally floats), again normally in collections of two or more samplings of a variable.
Time series data is a first-class entity within pandas. Time adds an important, extra dimension to samples of variables within pandas. Often variables are independent of the time they were sampled at; that is, the time at which they are sampled is not important. But in many cases they are. A time series forms a sample of a discrete variable at specific time intervals, where the observations have a natural temporal ordering.
A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations that are further apart. Time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values rather than from future values.
A common scenario with pandas is financial data where a variable represents the value of a stock as it changes at regular intervals throughout the day. We often want to determine changes in the rate of change of the price at specific intervals. We may also want to correlate the price of multiple stocks across specific intervals of time.
This is such an important and robust capability in pandas that we will spend an entire chapter examining the concept.
In this text, we will only approach the periphery of statistics and the technical processes of data analysis. But several analytical concepts of are worth noting, some of which have implementations directly created within pandas. Others will need to rely on other libraries such as SciPy, but you may also come across them while working with pandas so an initial shout-out is valuable.
Qualitative analysis is the scientific study of data that can be observed but cannot be measured. It focuses on cataloging the qualities of data. Examples of qualitative data can be:
- The softness of your skin
- How elegantly someone runs
Quantitative analysis is the study of actual values within data, with real measurements of items presented as data. Normally, these are values such as:
pandas deals primarily with quantitative data, providing you with extensive tools for representing observations of variables. Pandas does not provide for qualitative analysis, but does let you represent qualitative information.
Statistics, from a certain perspective, is the practice of studying variables, and specifically the observation of those variables. Much of statistics is based upon doing this analysis for a single variable, which is referred to as univariate analysis. Univariate analysis is the simplest form of analyzing data. It does not deal with causes or relationships and is normally used to describe or summarize data, and to find patterns in it.
Multivariate analysis is a modeling technique where there exist two or more output variables that affect the outcome of an experiment. Multivariate analysis is often related to concepts such as correlation and regression, which help us understand the relationships between multiple variables, as well as how those relationships affect the outcome.
pandas primarily provides fundamental univariate analysis capabilities. And these capabilities are generally descriptive statistics, although there is inherent support for concepts such as correlations (as they are very common in finance and other domains).
Other more complex statistics can be performed with StatsModels. Again, this is not per se a weakness of pandas, but a specific design decision to let those concepts be handled by other dedicated Python libraries.
Descriptive statistics are functions that summarize a given dataset, typically where the dataset represents a population or sample of a single variable (univariate data). They describe the dataset and form measures of a central tendency and measures of variability and dispersion.
For example, the following are descriptive statistics:
- The distribution (for example, normal, Poisson)
- The central tendency (for example, mean, median, and mode)
- The dispersion (for example, variance, standard deviation)
As we will see, the pandas Series and DataFrame objects have integrated support for a large number of descriptive statistics.
Inferential statistics differs from descriptive statistics in that inferential statistics attempts to infer conclusions from data instead of simply summarizing it. Examples of inferential statistics include:
- chi square
These inferential techniques are generally deferred from pandas to other tools such as SciPy and StatsModels.
Stochastic models are a form of statistical modeling that includes one or more random variables, and typically includes use of time series data. The purpose of a stochastic model is to estimate the chance that an outcome is within a specific forecast to predict conditions for different situations.
An example of stochastic modeling is the Monte Carlo simulation. The Monte Carlo simulation is often used for financial portfolio evaluation by simulating the performance of a portfolio based upon repeated simulation of the portfolio in markets that are influenced by various factors and the inherent probability distributions of the constituent stock returns.
pandas gives us the fundamental data structure for stochastic models in the DataFrame, often using time series data, to get up and running for stochastic models. While it is possible to code your own stochastic models and analyses using pandas and Python, in many cases there are domain-specific libraries such as PyMC to facilitate this type of modeling.
Bayesian statistics is an approach to statistical inference, derived from Bayes' theorem, a mathematical equation built off simple probability axioms. It allows an analyst to calculate any conditional probability of interest. A conditional probability is simply the probability of event A given that event B has occurred.
Therefore, in probability terms, the data events have already occurred and have been collected (since we know the probability). By using Bayes' theorem, we can then calculate the probability of various things of interest, given or conditional upon, this already observed data.
Bayesian modeling is beyond the scope of this book, but again the underlying data models are well handled using pandas and then actually analyzed using libraries such as PyMC.
Correlation is one of the most common statistics and is directly built into the pandas DataFrame. A correlation is a single number that describes the degree of relationship between two variables, and specifically between two sequences of observations of those variables.
A common example of using a correlation is to determine how closely the prices of two stocks follows each other as time progresses. If the changes move closely, the two stocks have a high correlation, and if there is no discernible pattern they are uncorrelated. This is valuable information that can be used in a number of investment strategies.
The level of correlation of two stocks can also vary slightly with the time frame of the entire dataset, as well as the interval. Fortunately, pandas has powerful capabilities for us to easily change these parameters and rerun correlations. We will look at correlations in several places later in the book.
Regression is a statistical measure that estimates the strength of relationship between a dependent variable and a series of other variables. It can be used to understand the relationships between variables. An example in finance would be understanding the relationship between commodity prices and the stocks of businesses dealing in those commodities.
pandas forms one small, but important, part of the data analysis and data science ecosystem within Python. As a reference, here are a few other important Python libraries worth noting. The list is not exhaustive, but outlines several you will likely come across..
NumPy (http://www.numpy.org/) is the cornerstone toolbox for scientific computing with Python, and is included in most distributions of modern Python. It is actually a foundational toolbox from which pandas was built, and when using pandas you will almost certainly use it frequently. NumPy provides, among other things, support for multidimensional arrays with basic operations on them and useful linear algebra functions.
The use of the array features of NumPy goes hand in hand with pandas, specifically the pandas Series object. Most of our examples will reference NumPy, but the pandas Series functionality is such a tight superset of the NumPy array that we will, except for a few brief situations, not delve into details of NumPy.
SciPy (https://www.scipy.org/) provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.
StatsModels (http://statsmodels.sourceforge.net/) is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that Stats Models fully meets their needs for statistical computing and data analysis in Python.
- Linear regression models
- Generalized linear models
- Discrete choice models
- Robust linear models
- Many models and functions for time series analysis
- Nonparametric estimators
- A collection of datasets as examples
- A wide range of statistical tests
- Input-output tools for producing tables in a number of formats (text, LaTex, HTML) and for reading Stata files into NumPy and pandas
- Plotting functions
- Extensive unit tests to ensure correctness of results
scikit-learn (http://scikit-learn.org/) is a machine learning library built from NumPy, SciPy, and matplotlib. It offers simple and efficient tools for common tasks in data analysis such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
PyMC (https://github.com/pymc-devs/pymc) is a Python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large number of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness of fit, and convergence diagnostics.
Python has a rich set of frameworks for data visualization. Two of the most popular are matplotlib and the newer seaborn.
Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.
pandas contains very tight integration with matplotlib, including functions as part of Series and DataFrame objects that automatically call matplotlib. This does not mean that pandas is limited to just matplotlib. As we will see, this can be easily changed to others such as ggplot2 and seaborn.
Seaborn (http://seaborn.pydata.org/introduction.html) is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for NumPy and pandas data structures and statistical routines from SciPy and StatsModels. It provides additional functionality beyond matplotlib, and also by default demonstrates a richer and more modern visual style than matplotlib.
In this chapter, we went on a tour of the how and why of pandas, data manipulation/analysis, and science. This started with an overview of why pandas exists, what functionality it contains, and how it relates to concepts of data manipulation, analysis, and data science.
Then we covered a process for data analysis to set a framework for why certain functions exist in pandas. These include retrieving data, organizing and cleaning it up, doing exploration, and then building a formal model, presenting your findings, and being able to share and reproduce the analysis.
Next, we covered several concepts involved in data and statistical modeling. This included covering many common analysis techniques and concepts, so as to introduce you to these and make you more familiar when they are explored in more detail in subsequent chapters.
pandas is also a part of a larger Python ecosystem of libraries that are useful for data analysis and science. While this book will focus only on pandas, there are other libraries that you will come across and that were introduced so you are familiar with them when they crop up.
We are ready to begin using pandas. In the next chapter, we will begin to ease ourselves into pandas, starting with obtaining a Python and pandas environment, an overview of Jupyter notebooks, and then getting a quick introduction to pandas Series and DataFrame objects before delving into them im more depth in subsequent elements of pandas.