By the end of this chapter, you will be able to:
Describe scikit-learn and its main advantages
Use the scikit-learn API
Perform data preprocessing
Explain the difference between supervised and unsupervised models, as well as the importance of choosing the right algorithm for each dataset
Scikit-learn is a well-documented and easy-to-use library that facilitates the application of machine learning algorithms by using simple methods, which ultimately enables beginners to model data without the need for deep knowledge of the math behind the algorithms. Additionally, thanks to the ease of use of this library, it allows the user to implement different approximations (create different models) for a data problem. Moreover, by removing the task of coding the algorithm, scikit-learn allows teams to focus their attention on analyzing the results of the model to arrive at crucial conclusions.
Spotify, a world leading company in the field of music streaming, uses scikit-learn, since it allows them to implement multiple models for a data problem, which are then easily connectable to their existing development. This process improves the process of arriving at a useful model, while allowing the company to plug them into their current app with little effort.
On the other hand, booking.com uses scikit-learn due to the wide variety of algorithms that the library offers, which allows them to fulfill the different data analysis tasks that the company relies on, such as building recommendation engines, detecting fraudulent activities, and managing the customer service team.
Considering the preceding points, this chapter begins with an explanation of scikit-learn and its main uses and advantages, and then moves on to provide a brief explanation of the scikit-learn API syntax and features. Additionally, the process to represent, visualize, and normalize data is shown. The aforementioned information will be useful to understand the different steps taken to develop a machine learning model.
Created in 2007 by David Cournapeau as part of a Google Summer of Code project, scikit-learn is an open source Python library made to facilitate the process of building models based on built-in machine learning and statistical algorithms, without the need for hard-coding. The main reasons for its popular use are its complete documentation, its easy-to-use API, and the many collaborators who work every day to improve the library.
You can find the documentation for scikit-learn at the following link: http://scikit-learn.org.
Scikit-learn is mainly used to model data, and not as much to manipulate or summarize data. It offers its users an easy-to-use, uniform API to apply different models, with little learning effort, and no real knowledge of the math behind it, required.
Some of the math topics that you need to know about to understand the models are linear algebra, probability theory, and multivariate calculus. For more information on these models, visit: https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568.
The models available under the scikit-learn library fall into two categories: supervised and unsupervised, both of which will be explained in depth in later sections. This category classification will help to determine which model to use for a particular dataset to get the most information out of it.
Besides its main use for interpreting data to train models, scikit-learn is also used to do the following:
Perform predictions, where new data is fed to the model to predict an outcome
Carry out cross validation and performance metrics analysis to understand the results obtained from the model, and thereby improve its performance
Obtain sample datasets to test algorithms over them
Perform feature extraction to extract features from images or text data
Although scikit-learn is considered the preferred Python library for beginners in the world of machine learning, there are several large companies around the world using it, as it allows them to improve their product or services by applying the models to already existing developments. It also permits them to quickly implement tests over new ideas.
You can visit the following website to find out which companies are using scikit-learn and what are they using it for: http://scikit-learn.org/stable/testimonials/testimonials.html.
In conclusion, scikit-learn is an open source Python library that uses an API to apply most machine learning tasks (both supervised or unsupervised) to data problems. Its main use is for modeling data; nevertheless, it should not be limited to that, as the library also allows users to predict outcomes based on the model being trained, as well as to analyze the performance of the model.
The following is a list of the main advantages of using scikit-learn for machine learning purposes:
Ease of use: Scikit-learn is characterized by a clean API, with a small learning curve in comparison to other libraries such as TensorFlow or Keras. The API is popular for its uniformity and straightforward approach. Users of scikit-learn do not necessarily need to understand the math behind the models.
Uniformity: Its uniform API makes it very easy to switch from model to model, as the basic syntax required for one model is the same for others.
Documentation/Tutorials: The library is completely backed up by documentation, which is effortlessly accessible and easy to understand. Additionally, it also offers step-by-step tutorials that cover all of the topics required to develop any machine learning project.
Reliability and Collaborations: As an open source library, scikit-learn benefits from the inputs of multiple collaborators who work each day to improve its performance. This participation from many experts from different contexts helps to develop not only a more complete library but also a more reliable one.
Coverage: As you scan the list of components that the library has, you will discover that it covers most machine learning tasks, ranging from supervised models such as classification and regression algorithms to unsupervised models such as clustering and dimensionality reduction. Moreover, due to its many collaborators, new models tend to be added in relatively short amounts of time.
The following is a list of the main disadvantages of using scikit-learn for machine learning purposes:
Inflexibility: Due to its ease of use, the library tends to be inflexible. This means that users do not have much liberty in parameter tuning or model architecture. This becomes an issue as beginners move to more complex projects.
Not Good for Deep Learning: As mentioned previously, the performance of the library falls short when tackling complex machine learning projects. This is especially true for deep learning, as scikit-learn does not support deep neural networks with the necessary architecture or power.
In general terms, scikit-learn is an excellent beginner's library as it requires little effort to learn its use and has many complementary materials thought to facilitate its application. Due to the contributions of several collaborators, the library stays up to date and is applicable to most current data problems.
On the other hand, it is a fairly simple library, not fit for more complex data problems such as deep learning. Likewise, it is not recommended for users who wish to take its abilities to a higher level by playing with the different parameters that are available in each model.
The main objective of machine learning is to build models by interpreting data. To do so, it is highly important to feed the data in a way that is readable by the computer. To feed data into a scikit-learn model, it must be represented as a table or matrix of the required dimension, which will be discussed in the following section.
Most tables fed into machine learning problems are two-dimensional, meaning that they contain rows and columns. Conventionally, each row represents an observation (an instance), whereas each column represents a characteristic (feature) of each observation.
The following table is a fragment of a sample dataset of scikit-learn. The purpose of the dataset is to differentiate from among three types of iris plants based on their characteristics. Hence, in the table, each row embodies a plant and each column denotes the value of that feature for every plant:
From the preceding explanation, the following snapshot shows data that corresponds to a plant with sepal length of 5.1, sepal width of 3.5, petal length of 1.4, and petal width of 0.2. The plant belongs to the setosa species:
For many data problems, one of the features of your dataset will be used as a label. This means that out of all the other features, this one is the target to which the model should generalize the data. For example, in the preceding table, we might choose the species as the target feature, and so we would like the model to find patterns based on the other features to determine whether a plant belongs to the setosa species. Therefore, it is important to learn how to separate the target matrix from the features matrix.
Features Matrix: The features matrix comprises data from each instance for all features, except the target. It can be either created using a NumPy array or a Pandas DataFrame, and its dimensions are [n_i, n_f], where n_i denotes the number of instances (such as a person) and n_f denotes the number of features (such as age). Generally, the features matrix is stored in a variable named X.
Target Matrix: Different than the features matrix, the target matrix is usually one-dimensional since it only carries one feature for all instances, meaning that its length is of value n_i (number of instances). Nevertheless, there are some occasions where multiple targets are required, and so the dimensions of the matrix become [n_i, n_t], where n_t is the number of targets to consider.
Similar to the features matrix, the target matrix is usually created as a NumPy array or a Pandas series. The values of the target array may be discrete or continuous. Generally, the target matrix is stored in a variable named Y.
All of the exercises and activities in these chapters will be primarily developed in Jupyter Notebook. It is recommended to keep a separate notebook for different assignments, unless advised otherwise. Also, to load a sample dataset, the seaborn library will be used, as it displays the data as a table. Other ways to load data will be explained in further sections.
In this exercise, we will be loading the iris dataset, and creating features and target matrices using this dataset.
For the exercises and activities within this chapter, you will need to have Python 3.6, seaborn, Jupyter, Matplotlib, and Pandas installed on your system.
Open a Jupyter Notebook to implement this exercise. In the cmd or terminal, navigate to the desired path and use the following command: jupyter notebook.
Load the iris dataset using the seaborn library. To do so, you first need to import the seaborn library, and then use the load_dataset() function, as shown in the following code:
import seaborn as sns iris = sns.load_dataset('iris')
As we can see from the preceding code, after importing the library, a nickname is given to facilitate its use along with the script.
The load_dataset() function loads datasets from an online repository. The data from the dataset is stored in a variable named iris.
Create a variable, X, to store the features. Use the drop() function to include all of the features but the target, which in this case is named species. Then, print out the top 10 instances of the variable:
X = iris.drop('species', axis=1) X.head(10)
The axis parameter in the preceding snippet denotes whether you want to drop the label from rows (axis = 0) or columns (axis = 1).
The printed output should look as follows:
Print the shape of your new variable using the X.shape command:
X.shape (150, 4)
The first value indicates the number of instances in the dataset (150), and the second value represents the number of features (4).
Create a variable, Y, that will store the target values. There is no need to use a function for this. Use indexing to grab only the desired column. Indexing allows you to access a section of a larger element. In this case, we want to grab the column named species. Then, print out the top 10 values of the variable:
Y = iris['species'] Y.head(10)
The printed output should look as follows:
Print the shape of your new variable by using the Y.shape command:
The shape should be one-dimensional with length equal to the number of instances (150).
Congratulations! You have successfully created the features and target matrices of a dataset.
Generally, the preferred way to represent data is by using two-dimensional tables, where the rows represent the number of observations, also known as instances, and the columns represent the characteristics of those instances, commonly known as features.
For data problems that require target labels, the data table needs to be partitioned into a features matrix and a target matrix. The features matrix will contain the values of all features but the target, for each instance, making it a two-dimensional matrix. On the other hand, the target matrix will only contain the value of the target feature for all entries, making it a one-dimensional matrix.
In this activity, we will attempt to load a dataset and create the features and target matrices by choosing the appropriate target feature for the objective of the study. Let's look at the following scenario: you work in the safety department of a cruise company. The company wants to include more lower-deck cabins, but it wants to be sure that the measure will not increase the number of fatalities in the case of an accident. The company has provided your team with a dataset of the Titanic passenger list to determine whether lower-deck passengers are less likely to survive. Your job is to select the target feature that most likely helps to achieve this objective.
To choose the target feature, remember that the target should be the outcome to which we want to interpret the data for. For instance, if we want to know what features play a role in determining a plant's species, the species should be the target value.
Follow the steps below to complete this activity:
Load the titanic dataset using the seaborn library. The first couple of rows should look like this:
Select your preferred target feature for the goal of this activity.
Create both the features matrix and the target matrix. Make sure that you store the data from the features matrix in a variable, X, and the data from the target matrix in another variable, Y.
Print out the shape of each of the matrices, which should match the following values:
Features matrix: (891,14)
Target matrix: (891)
For the computer to be able to understand the data proficiently, it is necessary to not only feed the data in a standardized way but also make sure that the data does not contain outliers or noisy data, or even missing entries. This is important because failing to do so might result in the system making assumptions that are not true to the data. This will cause the model to train at a slower pace and to be less accurate due to misleading interpretations of data.
Moreover, data preprocessing does not end there. Models do not work the same way, and each one makes different assumptions. This means that we need to preprocess in terms of the model that is going to be used. For example, some models accept only numerical data, whereas others work with nominal and numerical data.
To achieve better results during data preprocessing, a good practice is to transform (preprocess) the data in different ways, and then test the different transformations in different models. That way, you will be able to select the right transformation for the right model.
Data that is missing information or that contains outliers or noise is considered to be messy data. Failing to perform any preprocessing to transform the data can lead to poorly created models of the data, due to the introduction of bias and information loss. Some of the issues with data that should be avoided will be explained here.
Features where a few instances have values, as well as instances where there are no values for any feature, are considered missing data. As you can see from the following image, the vertical red rectangle represents a feature with only 3 values out of 10, and the horizontal rectangle represents an instance with no values at all:
Conventionally, a feature missing more than 5 to 10% of its values is considered to be missing data, and so needs to be dealt with. On the other hand, all instances that have missing values for all features should be eliminated as they do not provide any information to the model, and, on the contrary, may end up introducing bias.
When dealing with a feature with a high absence rate, it is recommended to either eliminate it or fill it with values. The most popular ways to replace the missing values are as follows:
Mean imputation: Replacing missing values with the mean or median of the features' available values
Regression imputation: Replacing missing values with the predicted values obtained from a regression function
While mean imputation is a simpler approach to implement, it may introduce bias as it evens out all instances in that matter. On the other hand, even though the regression approach matches the data to its predicted value, it may end up overfitting the model as all values introduced follow a function.
Lastly, when the missing values are found in a text feature such as gender, the best book of action would be to either eliminate them or replace them with a class labeled uncategorized or something similar. This is mainly because it is not possible to apply either mean or regression imputation over text.
Labeling missing values with a new category (uncategorized) is mostly done when eliminating them removes an important part of the dataset, and hence is not an appropriate book of action. In this case, even though the new label may have an effect on the model depending on the rationale used to label the missing values, leaving them empty is an even worse alternative as it causes the model to make assumptions on its own.
To learn more on how to detect and handle missing values, feel free to visit the following page: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4.
Outliers are values that are far from the mean. This means that if the values from an attribute follow a Gaussian distribution, the outliers are located at the tails.
Outliers can be global or local. The former group represents those values that are far from the entire set of values of a feature. For example, when analyzing data from all members of a neighborhood, a global outlier would be a person who is 180 years old (as shown in the following diagram (A)). The latter, on the other hand, represents values that are far from a subgroup of values of that feature. For the same example that we saw previously, a local outlier would be a college student who is 70 years old (B), which would normally differ from other college students in that neighborhood:
Considering both examples that have been given, outliers do not evaluate whether the value is possible. While a person aged 180 years is not plausible, a 70-year-old college student might be a possibility, yet both are categorized as outliers as they can both affect the performance of the model.
A straightforward approach to detect outliers consists of visualizing the data to determine whether it follows a Gaussian distribution, and if it does, classifying those values that fall between three to six standard deviations away from the mean as outliers. Nevertheless, there is not an exact rule to determine an outlier, and the decision to select the number of standard deviations is subjective and will vary from problem to problem.
For example, if the dataset is reduced by 40% by setting three standard deviations as the parameter to rule out values, it would be appropriate to change the number of standard deviations to four.
On the other hand, when dealing with text features, detecting outliers becomes even trickier as there are no standard deviations to use. In this case, counting the occurrences of each class value would help to determine whether a certain class is indispensable or not. For instance, in clothing sizes, having a size XXS that represents less than 5% of the entire dataset might not be necessary.
Once the outliers are detected, there are three common ways to handle them:
Delete the outlier: For outliers that are true values, it is best to completely delete them to avoid skewing the analysis. This may be a good idea for outliers that are mistakes, if the number of outliers is too large to perform further analysis to assign a new a value.
Define a top: Defining a top might also be useful for true values. For instance, if you realize that all values above a certain threshold behave the same way, you can consider topping that value with the threshold.
Assign a new value: If the outlier is clearly a mistake, you can assign a new value using one of the techniques that we discussed for missing values (mean or regression imputation).
The decision to use each of the preceding approaches depends on the outlier type and number. Most of the time, if the number of outliers represents a small proportion of the total size of the dataset, there is no point in treating the outlier in any way other than deleting it.
Noisy data corresponds to values that are not correct or possible. This includes numerical (outliers that are mistakes) and nominal values (for example, a person's gender misspelled as "fimale"). Like outliers, noisy data can be treated by deleting the values completely or by assigning them a new value.
In this exercise, we will be using the titanic dataset as an example to demonstrate how to deal with messy data:
Open a Jupyter Notebook to implement this exercise.
Load the titanic dataset and store it in a variable called titanic. Use the following code:
import seaborn as sns titanic = sns.load_dataset('titanic')
Next, create a variable called age to store the values of that feature from the dataset. Print out the top 10 values of the age variable:
age = titanic['age'] age.head(10)
The output will appear as follows:
As you can see, the feature has NaN (Not a Number) values, which represent missing values.
Check the shape of the age variable. Then, count the number of NaN values to determine how to handle them. Use the isnull() function to find the NaN values, and use the sum() function to sum them all:
age.shape (891,) age.isnull().sum() 177
The participation of the NaN values in the total size of the variable is 5.03%. Although this is not high enough to consider removing the entire feature, there is a need to handle the missing values.
Let's choose the mean imputation methodology to replace the missing values. To do so, compute the mean of the available values. Use the following code:
mean = age.mean() mean = mean.round() mean
The mean comes to be 30.
Replace all missing values with the mean. Use the fillna() function. To check that the values have been replaced, print the first ten values again:
The printed output is shown below:
As you can see in the preceding screenshot, the age of the instance with index 5 has changed from NaN to 30, which is the mean that was calculated previously. The same procedure occurs for all 177 NaN values.
Import Matplotlib and graph a histogram of the age variable. Use Matplotlib's hist() function. To do so, type in the following code:
import matplotlib.pyplot as plt plt.hist(age) plt.show()
The histogram should look like it does in the following diagram, and as we can see, its distribution is Gaussian-like:
Discover the outliers in the data. Let's use three standard deviations as the measure to calculate the min and max values.
As discussed previously, the min value is determined by calculating the mean of all of the values and subtracting three standard deviations from it. Use the following code to set the min value and store it in a variable named min_val:
min_val = age.mean() - (3 * age.std()) min_val
The min value comes to be around −9.248. According to the min value, there are no outliers at the left tail of the Gaussian distribution. This makes sense, given that the distribution is tilted slightly to the left.
Opposite to the min value, for the max value, the standard deviations are added to the mean to calculate the higher threshold. Calculate the max value, as shown in the following code, and store it in a variable named max_val:
max_val = age.mean() + (3 * age.std()) max_val
The max value, which comes to around 68.766, determines that instances with ages above 68.76 years represent outliers. As you can see in the preceding diagram, this also makes sense as there are little instances over that threshold and they are in fact far away from the bell of the Gaussian distribution.
Count the number of instances that are above the max value to decide how to handle them.
First, using indexing, call the values in age that are above the max value, and store them in a variable called outliers. Then, count the outliers using count():
outliers = age[age > max_val] outliers.count()
The output shows us that there are seven outliers. Print out the outliers by typing in outliers and check that the correct values were stored:
As the number of outliers is small, and they correspond to true outliers, they can be deleted.
For this exercise, we will be deleting the instances from the age variable to understand the complete procedure of dealing with outliers. However, later, the deletion of outliers will be handled in consideration of all features, in order to delete the entire instance, and not just the age values.
Redefine the value stored in age by using indexing to include only values below the max threshold. Then, print the shape of age:
age = age[age <= max_val] age.shape (884,)
As you can see, the shape of age has been reduced by seven, which was the number of outliers.
Congratulations! You have successfully cleaned out a Pandas Series. This process serves as a guide for cleaning a dataset later on.
To summarize, we have discussed the importance of preprocessing data, as failing to do so may introduce bias in the model, which affects the training time of the model and its performance. Some of the main forms of messy data are missing values, outliers, and noise.
Missing values, as their name suggests, are those values that are left empty or null. When dealing with many missing values, it is important to handle them by deletion or by assigning new values. Two ways to assign new values were also discussed: mean imputation and regression imputation.
Outliers are values that fall far from the mean of all the values of a feature. One way to detect outliers is by selecting all the values that fall outside the mean minus/plus three-six standard deviations. Outliers may be mistakes (values that are not possible) or true values, and they should be handled differently. While true outliers may be deleted or topped, mistakes should be replaced with other values when possible.
Finally, noisy data corresponds to values that are, regardless of their proximity to the mean, mistakes or typos in the data. They can be of numeric, ordinal, or nominal types.
Categorical features are those that comprise discrete values typically belonging to a finite set of categories. Categorical data can be nominal or ordinal. Nominal refers to categories that do not follow a specific order, such as music genre or city names, whereas ordinal refers to categories with a sense of order, such as clothing sizes or level of education.
Even though improvements in many machine learning algorithms have enabled the algorithms to understand categorical data types such as text, the process of transforming them into numeric values facilitates the training process of the model, which results in faster running times and better performance. This is mainly due to the elimination of semantics available in each category, as well as the fact that the conversion into numeric values allows you to scale all of the features of the dataset equally, as explained previously.
How does it work? Feature engineering generates a label encoding that assigns a numeric value to each category; this value will then replace the category in the dataset. For example, a variable called genre with the classes pop, rock, and country can be converted as follows:
In this exercise, we will be converting the text data within the embark_town feature of the titanic dataset into numerical data. Follow these steps:
Use the same Jupyter Notebook that you created for the last exercise.
Import scikit-learn's LabelEncoder() class, as well as the Pandas library. Use the following code:
from sklearn.preprocessing import LabelEncoder import pandas as pd
Create a variable called em_town and store the information of that feature from the titanic dataset that was imported in the previous exercise. Print the top 10 values from the new variable:
em_town = titanic['embark_town'] em_town.head(10)
The output looks as follows:
As you can see, the variable contains text data.
Convert the text data into numeric values. Use the class that was imported previously (LabelEncoder):
enc = LabelEncoder() new_label = pd.Series(enc.fit_transform(em_town.astype('str')))
First of all, initialize the class by typing in the first line of code. Second, create a new variable called new_label and use the built-in method fit_transform() from the class, which will assign a numeric value to each category and output the result. We use the pd.Series() function to convert the output from the label encoder into a Pandas Series. Print out the top 10 values of the new variable:
As you can see, the text categories of the variable have been converted into numeric values.
Congratulations! You have successfully converted text data into numeric values.
While improvements in machine learning have made dealing with text features easier for some algorithms, it is best to convert them into numeric values. This is mainly important as it eliminates the complexity of dealing with semantics, not to mention that it gives the flexibility to change from model to model, without any limitations.
Text data conversion is done via feature engineering, where every text category is assigned a numeric value that replaces it. Furthermore, even though this can be done manually, there are powerful built-in classes and methods that facilitate this process. One example of this is the use of scikit-learn's LabelEncoder class.
Why is it important to rescale data? Because even though the data may be fed to a model using different scales for each feature, the lack of homogeneity can cause the algorithm to lose its ability to discover patterns from the data due to the assumptions it has to make to understand it, thereby slowing down the training process and negatively affecting the model's performance.
Data rescaling helps the model run faster, without any burden or responsibility to learn from the invariance present in the dataset. Moreover, a model trained over equally scaled data assigns the same weights to all parameters, which allows the algorithm to generalize to all features and not just to those with higher values, irrespective of their meaning.
An example of a dataset with different scales is one that contains different features, one measured in kilograms, another measuring temperature, and another counting the number of children. Even though the values of each attribute are true, the scale of each one of them highly differs from that of the other. For example, while the values in kilograms can go higher than 100, the children count will typically not go further than 10.
Two of the most popular ways to rescale data are data normalization and data standardization. There is no rule on selecting the methodology to transform data to scale it, as all datasets behave differently. The best practice is to transform the data using two or three rescaling methodologies and test the algorithms in each one of them in order to choose the one that best fits the data based on the performance.
Rescaling methodologies are to be used individually. When testing different rescaling methodologies, the transformation of data should be done independently. Each transformation can be tested over a model, and the best suited one should be chosen for further steps.
Normalization: Data normalization in machine learning consists of rescaling the values of all features such that they lie in a range between 0 and 1 and have a maximum length of one. This serves the purpose of equating attributes of different scales.
The following equation allows you to normalize the values of a feature:
Here, zi corresponds to the ith normalized value and x represents all values.
Standardization: This is a rescaling technique that transforms the data into a Gaussian distribution with a mean equal to 0 and a standard deviation equal to 1.
One simple way of standardizing a feature is shown in the following equation:
Here, zi corresponds to the ith standardized value, and x represents all values.
This section covers the normalization and standardization of data, using the titanic dataset as an example. Use the same Jupyter Notebook that you created for the last exercise:
Using the age variable that was created in the first exercise of this notebook, normalize the data using the preceding formula and store it in a new variable called age_normalized. Print out the top 10 values:
age_normalized = (age - age.min())/(age.max()-age.min()) age_normalized.head(10)
As you can see in the preceding screenshot, all of the values have been converted to their equivalents in a range between 0 and 1. By performing the normalization for all of the features, the model will be trained on the features of the same scale.
Again, using the age variable, standardize the data using the formula for standardization, and store it in a variable called age_standardized. Print out the top 10 values:
age_standardized = (age - age.mean())/age.std() age_standardized.head(10)
Different than normalization, in standardization, the values distribute normally around zero.
Print out the mean and standard deviation of the age_standardized variable to confirm its mean of 0 and standard deviation of 1:
print("Mean: " + str(age_standardized.mean())) print("Standard Deviation: " + str(age_standardized.std())) Mean: 9.645376503530772e-17 Standard Deviation: 1.0
As you can see, the mean approximates to 0, and the standard deviation is equal to 1, which means that the standardization of the data was successful.
Congratulations! You have successfully applied rescaling methods to your data.
In conclusion, we have covered the final step in data preprocessing, which consists of rescaling data. This process was done in a dataset with features of different scales, with the objective of homogenizing the way data is represented to facilitate the comprehension of the data by the model.
Failing to rescale data will cause the model to train at a slower pace and might negatively affect the performance of the model.
Two methodologies for data rescaling were explained in this topic: normalization and standardization. On one hand, normalization transforms the data to a length of one (from 0 to 1). On the other hand, standardization converts the data into a Gaussian distribution with a mean of 0 and a standard deviation of 1.
Given that there is no rule for selecting the appropriate rescaling methodology, the recommended book of action is to transform the data using two or three rescaling methodologies independently, and then train the model with each transformation to evaluate the methodology that behaves best.
You continue to work for the safety department at a cruise company. As you did great work selecting the ideal target feature to develop the study, the department has decided to commission you into preprocessing the data set as well. For this purpose, you need to use all the techniques you have learned about previously to preprocess the dataset and get it ready for model training. The following steps serve to guide you in that direction:
Load the dataset and create the features and target matrices by typing in the following code:
import seaborn as sns titanic = sns.load_dataset('titanic') X = titanic[['sex','age','fare','class','embark_town','alone']] Y = titanic['survived']
Check for missing values and outliers in all the features of the features matrix (X). Choose a methodology to handle them.
notnull(): To detect non-missing values. For instance, X[X["age"].notnull()] will retrieve all the rows in X, except those that are missing values under the column age.
value.counts(): To count the occurrence of unique values of an array. For example, X["gender"].value_counts() will count the number of times the classes male and female are present.
Convert all text features into its numeric representation.
Rescale your data, either by normalizing or standardizing.
Results may vary depending on the choices you made. However, you must be left with a dataset with no missing values, outliers, or text features, and with data rescaled.
The objective of the scikit-learn API is to provide an efficient and unified syntax to make machine learning accessible to non-machine learning experts, as well as to facilitate and popularize its use among several industries.
Although it has many collaborators, the scikit-learn API was built and has been updated by considering a set of principles that prevent framework code proliferation, where different codes perform similar functionalities. On the contrary, it promotes simple conventions and consistency. Due to this, the scikit-learn API is consistent among all models, and once the main functionalities have been learned, it can be widely used.
The scikit-learn API is divided into three complementary interfaces that share a common syntax and logic: the estimator, the predictor, and the transformer. The estimator interface is used for creating models and fitting the data into them; the predictor, as the name suggests, is used to make predictions based on the models trained before; and finally, the transformer is used for converting data.
This is considered to be the core of the entire API, as it is the interface in charge of fitting the models to the input data. It works by initializing the model to be used, and then applying a fit() method that triggers the learning process to build a model based on the data.
The fit() method receives as arguments the training data, in two separate variables, the features matrix, and the target matrix (conventionally called X_train and Y_train). For unsupervised models, the method only takes in the first argument (X_train).
This method creates the model trained to the input data, which can later be used for predicting.
Some models take other arguments besides the training data, which are also called hyperparameters. These hyperparameters are initially set to their default values, but can be tuned to improve the performance of the model, which will be discussed in further sections.
The following is an example of a model being trained:
from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X_train, Y_train)
First, it is required that you import the type of algorithm to be used from scikit-learn, for example, a Gaussian Naïve Bayes algorithm for classification. It is always a good practice to import only the algorithm to be used, and not the entire library, as this will ensure that your code runs faster.
To find out the syntax to import a different model, use the documentation of scikit-learn. Go to the following link, click over the algorithm that you wish to implement, and you will find the instructions there: http://scikit-learn.org/stable/user_guide.html.
The second line of code oversees the initialization of the model and stores it in a variable. Lastly, the model is fit to the input data.
In addition to this, the estimator also offers other complementary tasks, as follows:
Feature extraction, which involves transforming input data into numerical features that can be used for machine learning purposes
Feature selection, which selects the features in your data that most contribute to the prediction output of the model
Dimensionality reduction, which takes higher-dimensional data and converts it into a lower dimension
As explained previously, the predictor takes the model created by the estimator and extends it to perform predictions on unseen data. In general terms, for supervised models, it feeds the model a new set of data, usually called X_test, to get a corresponding target or label based on the parameters learned during the training of the model.
Moreover, some unsupervised models can also benefit from the predictor. While this method does not output a specific target value, it can be useful to assign a new instance to a cluster.
Following the preceding example, the implementation of the predictor can be seen as follows:
Y_pred = model.predict(X_test)
We apply the predict() method to the previously trained model, and input the new data as an argument to the method.
In addition to predicting, the predictor can also implement methods that are in charge of quantifying the confidence of the prediction, also called the performance of the model. These confidence functions vary from model to model, but their main objective is to determine how far the prediction is from reality. This is done by taking an X_test with its corresponding Y_test and comparing it to the predictions made with the same X_test.
As we saw previously, data is usually transformed before being fed to a model. Considering this, the API contains a transform() method that allows you to perform some preprocessing techniques.
It can be used both as a starting point to transform the input data of the model (X_train), as well as further along to modify data that will be fed to the model for predictions. This latter application is crucial to get accurate results, as it ensures that the new data follows the same distribution as the data used to train the model.
The following is an example of a transformer that normalizes the values of the training data:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train)
As you can see, after importing and initializing the transformer, it needs to be fit to the data to then effectively transform it:
X_test = scaler.transform(X_test)
The advantage of the transformer is that once it has been applied to the training dataset, it stores the values used for transforming the training data; this can be used to transform the test dataset to the same distribution.
In conclusion, we discussed one of the main benefits of using scikit-learn, which is its API. This API follows a consistent structure that makes it easy for non-experts to apply machine learning algorithms.
To model an algorithm on scikit-learn, the first step is to initialize the model class and fit it to the input data using an estimator, which is usually done by calling the fit() method of the class. Finally, once the model has been trained, it is possible to predict new values using the predictor by calling the predict() method of the class.
Additionally, scikit-learn also has a transformer interface that allows you to transform data as needed. This is useful for performing preprocessing methods over the training data, which can then be also used to transform the testing data to follow the same distribution.
Machine learning is divided into two main categories: supervised and unsupervised learning.
Supervised learning consists of understanding the relation between a given set of features and a target value, also known as a label or class. For instance, it can be used for modeling the relationship between a person's demographic information and their ability to pay loans, as shown in the following table:
Models trained to foresee these relationships can then be applied to predict labels for new data. As we can see from the preceding example, a bank that builds such a model can then input data from loan applicants to determine if they are likely to pay back the loan.
These models can be further divided into classification and regression tasks, which are explained as follows.
Classification tasks are used to build models out of data with discrete categories as labels; for instance, a classification task can be used to predict whether a person will pay a loan. You can have more than two discrete categories, such as predicting the ranking of a horse in a race, but they must be a finite number.
Most classification tasks output the prediction as the probability of an instance to belong to each output label. The assigned label is the one with the highest probability, as can be seen in the following diagram:
Some of the most common classification algorithms are as follows:
Decision trees: This algorithm follows a tree-like architecture that simulates the decision process given a previous decision.
Naïve Bayes classifier: This algorithm relies on a group of probabilistic equations based on Bayes' theorem, which assumes independence among features. It has the ability to consider several attributes.
Artificial neural networks (ANNs): These replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture. They pass information to one another until a result is achieved.
Regression tasks, on the other hand, are used for data with continuous quantities as labels; for example, a regression task can be used for predicting house prices. This means that the value is represented by a quantity and not by a set of possible outputs. Output labels can be of integer or float types:
The most popular algorithm for regression tasks is linear regression. It consists of only one independent feature (x) whose relation with its dependent feature (y) is linear. Due to its simplicity, it is often overseen, even though it performs very well for simple data problems.
Other, more complex regression algorithms include regression trees and support vector regression, as well as ANNs once again.
In conclusion, for supervised learning problems, each instance has a correct answer, also known as a label or class. The algorithms under this category aim to understand the data and then predict the class of a given set of features. Depending on the type of class (continuous or discrete), the supervised algorithms can be divided into classification or regression tasks.
Unsupervised learning consists of modeling the model to the data, without any relationship with an output label, also known as unlabeled data. This means that algorithms under this category search to understand the data and find patterns in it. For instance, unsupervised learning can be used to understand the profile of people belonging to a neighborhood, as shown in the following diagram:
When applying a predictor over these algorithms, no target label is given as output. The prediction, only available for some models, consists of placing the new instance into one of the subgroups of data that has been created.
Unsupervised learning is further divided into different tasks, but the most popular one is clustering, which will be discussed next.
Clustering tasks involve creating groups of data (clusters) and complying with the condition that instances from other groups differ visibly from the instances within the group. The output of any clustering algorithm is a label, which assigns the instance to the cluster of that label:
The preceding diagram shows a group of clusters, each of a different size, based on the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.
Some of the most popular clustering algorithms are as follows:
k-means: This focuses on separating the instances into n clusters of equal variance by minimizing the sum of the squared distances between two points.
Mean-shift clustering: This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): This determines clusters as areas with a high density of points, separated by areas with low density.
In conclusion, unsupervised algorithms are designed to understand data when there is no label or class that indicates a correct answer for each set of features. The most common types of unsupervised algorithms are the clustering methods that allow you to classify a population into different groups.
Machine learning consists of constructing models, some of which are based on complicated mathematical concepts, to understand data. Scikit-learn is an open source Python library that is meant to facilitate the process of applying these models to data problems, without much complex math knowledge required.
This chapter first covered an important step in developing a data problem, that is, representing the data in a tabular manner. Then, the steps involved in the creation of features and target matrices, data preprocessing, and choosing an algorithm were also covered.
Finally, after selecting the type of algorithm that best suits the data problem, the construction of the model can begin through the use of the scikit-learn API, which has three interfaces: estimators, predictors, and transformers. Thanks to the uniformity of the API, learning to use the methods for one algorithm is enough to enable their use for others.
With all of this in mind, in the next chapter, we will focus on detailing the process of implementing an unsupervised algorithm to a real-life dataset.