You're reading from Julia Cookbook
In this chapter, you will learn how to visualize and present data and analyze the findings from the data science approach you have adopted to solve a particular problem. There are various types of visualization to display your findings: bar plots, the scatter plots, pie charts, and so on, and it is very important to choose an appropriate method that can reflect your findings and work in a sensible and an aesthetically pleasing manner.
Importance of visualizations and reporting in data science:
Visualization is the art of displaying quantitative information in a sensible, legible, and aesthetically pleasing way. It consists of plotting quantitative information in the form of various graphs as well as putting forward or compiling the analyses and the results in a precise and a legible report.
Visualizations and reporting should always be done in such a way that the person or the group to whom they are being presented to should be able to follow and appreciate it with minimal background...
Arrays are one of the fundamental data structures used in data analysis to store various types of data. They are also a quick way to store columns or dimensions in data, for statistical analysis as well as exploratory analysis through plots and visualization. Arrays are also very easy to plot, as they are simple. When a visualization is being done with two columns of a dataset, it means that the two column values are taken in the form of separate arrays and then plotted against each other, which again makes arrays very important.
To get started with this recipe, you have to install the Gadfly
library. This can be done using the following command:
Pkg.add("Gadfly")
Next, to import the library, we can import it by calling by its name, which is Gadfly. This can be done as follows:
using Gadfly
Dataframes are one of the datastructures on which most analytics and machine learning implementations are done. It is the most popular and best way for representing tabular data. They are made up of several arrays and similar data structures, and they can store data in multiple formats, including logical data, string data, and numeric data. So, visualizations can be done against one or multiple columns of the same dataframe, which makes it easy for the analyst to express numerical information in the dataframe.
To get started with this recipe, you have to install the Gadfly
library as you did in the previous recipe.
As we will be using the datasets from R packages, we also need to import the RDatasets
package. This can be done simply by the using ...
syntax, which we use for importing packages:
using RDatasets
In data science and statistical modeling, there are several instances where an analyst needs to use several functions for both transforming and exploratory analytics steps. So, one can plot them in Gadfly in a very simple way, which can used to plot separate functions as well as to stack several functions in a single plot.
As we already specified, we will use the Gadfly plotting library for this recipe too. So, follow the installation steps from the previous recipes.
Let's start with a basic function plot to get familiar with the syntax. So, a good basic function to start is the
sin()
function, which can be invoked as sin. The function can be included directly in the plot command, along with the upper and lower limits of the x axis. The syntax is:plot(function, lower_limt, upper_limit)
. This can be done as follows:plot(sin, 0, 30)
Similarly, if we want to plot multiple functions on a single plot, we can do just like we did in the previous...
Exploratory data analytics is one of the most important processes in a data science workflow. It is simply a thorough exploration of the data to find any possible patterns that can be identified through basic statistics and the shape of the data. It is mostly done with the help of plots, as visual information is much easier to comprehend than complex statistical terms. So, in this recipe, we will go through some exploratory analytics methods with the help of plots.
The Gadfly
library, which we used for our recipes, also contains most of the plots that are frequently used for exploratory data analytics. We will use the same library for this purpose too. So, to install the library, you can follow the installation steps mentioned in the previous recipes.
We will also use datasets from the RDatasets
package, which contains datasets that are in the data repository of the R programming language. So, to install the RDatasets
package and invoke...
Line plots, as we have already seen in the preceding examples, are very effective when it comes to exploratory data analytics. They can be used both to understand correlations and look at data trends. So, by further making use of aesthetics, we can make them more interesting and informative.
We will use the Gadfly
library, which we have used in the preceding recipes. So, to install the library, you can follow the installation steps mentioned in the previous recipes.
Let's start with a basic line plot, which plots their incidences of melanoma in the respective years. So, this plot can be seen as a typical time series plot, where the x axis is a time variable and the y axis is the variable that is parameterized by time. So, to plot this, we simply need to include the dataset in the
plot()
function and include theGeom.line
aesthetic, as follows:plot(dataset("Lattice", "melanoma"), x = "Year", y = "Incidence", Geom.line)
We can also have multiple line...
Scatter plots are the most basic plots in exploratory analytics. They help the analyst get a rough idea of the data distribution and the relationship between the corresponding columns, which in turn helps identify some prominent patterns in the data.
We will use the Gadfly
library, which we used in the preceding recipes. So, to install the library, you can follow the installation steps mentioned in the previous recipes.
Let's start off with plotting a simple scatter plot of iris features: the length and the width. This will help us identify the relationship between the two features of the flower. This can be done using a line plot similar to the one in the preceding recipe, but including the aesthetic
Geom.point
instead ofGeom.line
in theplot()
function. This can be done as follows:plot(dataset("datasets", "iris"), x = "SepalLength", y = "SepalWidth", Geom.point)
Next, we will try to put in some aesthetics on the plot to make it more informative...
Histograms are one of the best ways for visualizing and finding out the three main statistics of a dataset: the mean, median, and mode. Histograms also help analysts get a very clear understanding of the distribution of data. The ability to plot categorical data as well as numerical data is what makes the histogram unique.
We will use the Gadfly
library, which we used for understanding and plotting data in the preceding recipes. So, to install the library, you can follow the installation steps mentioned in the previous recipes.
A basic histogram is a simple set of stacked bars, which shows the distribution of a particular feature in a dataset. This can be plotted using the
plot()
function, with theGeom.histogram
attribute as the aesthetic parameter. We will use thediamonds
dataset for the purpose. This can be done as follows:plot(dataset("ggplot2", "diamonds"), x = "Price", Geom.histogram)
As with earlier plots, color aesthetics can be used to differentiate...
As we have already gone through how to plot the most important visualizations and their customizations in the Gadfly
library, we will also see how to customize them even further. The Gadfly
library allows the analyst to almost completely tweak and customize their visualizations so that they can be better fitted to the dataset properties are very flexible for our purposes.
We will use the Gadfly
library, which we used in the preceding recipes. So, to install the library, you can follow the installation steps mentioned in the previous recipes.
The limits of the axes can be customized or transformed to the logarithmic scale with the
Scale.x_log
parameter in theplot()
function. This would help in visualizing exponentially increasing data or data in different scales. We will scale the x axis in this example. This can be done as follows:plot(x = rand(10), y = rand(10), Scale.x_log)
The minimum and maximum values in the plot or in a particular...