Reader small image

You're reading from  Mastering Numerical Computing with NumPy

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788993357
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Umit Mert Cakmak
Umit Mert Cakmak
author image
Umit Mert Cakmak

Umit Mert Cakmak is a data scientist at IBM, where he excels at helping clients solve complex data science problems, from inception to delivery of deployable assets. His research spans multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.
Read more about Umit Mert Cakmak

Tiago Antao
Tiago Antao
author image
Tiago Antao

Tiago Antao is a bioinformatician currently working in the field of genomics. A former computer scientist, Tiago moved into computational biology with an MSc in Bioinformatics from the Faculty of Sciences at the University of Porto (Portugal) and a PhD on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine (UK). Postdoctoral, Tiago has worked with human datasets at the University of Cambridge (UK) and with mosquito whole genome sequencing data at the University of Oxford (UK), before helping to set up the bioinformatics infrastructure at the University of Montana. He currently works as a data engineer in the biotechnology field in Boston, MA. He is one of the co-authors of Biopython, a major bioinformatics package written in Python.
Read more about Tiago Antao

Mert Cuhadaroglu
Mert Cuhadaroglu
author image
Mert Cuhadaroglu

Mert Cuhadaroglu is a BI Developer in EPAM, developing E2E analytics solutions for complex business problems in various industries, mostly investment banking, FMCG, media, communication, and pharma. He consistently uses advanced statistical models and ML algorithms to provide actionable insights. Throughout his career, he has worked in several other industries, such as banking and asset management. He continues his academic research in AI for trading algorithms.
Read more about Mert Cuhadaroglu

View More author details
Right arrow

Predicting Housing Prices Using Linear Regression

In this chapter, we will introduce supervised learning and predictive modeling by implementing linear regression. In the previous chapter, you learned about exploratory analysis, but haven't looked at modeling yet. In this chapter, we will create a linear regression model to predict housing market prices. Broadly speaking, we are going to predict target variable with the help of its relationship with other variables. Linear regression is very widely used and is a simple model for supervised machine learning algorithms. It's essentially about fitting a line for the observed data. We will start our journey with explaining supervised learning and linear regression. Then, we will analyze the crucial concepts of linear regression such as independent and dependent variables, hyperparameters, loss and error functions, and stochastic...

Supervised learning and linear regression

Machine learning gives computer systems an ability to learn without explicit programming. One of the most common types of machine learning is supervised learning. Supervised learning consists of a set of different algorithms which formulates a learning problem and solves them by mapping inputs and outputs using historical data. The algorithms analyze the input and a corresponding output, then link them together to find a relationship (learning). Finally, for the new given dataset, it will predict the output by using this learning.

In order to differentiate between supervised and unsupervised learning, we can think about input/output-based modeling. In supervised learning, the computer system will be supervised with labels for every set of input data. In unsupervised learning, the computer system will only use input data without any labels...

Independent and dependent variables

As we mentioned in the previous subsection, linear regression is used to predict a value of a variable based on other variables. We are investigating the relationship between input variables, X and the output variable, Y.

In linear regression, dependent variable is the variable that we want to predict. The reason that we call it the dependent variable is because of the assumption behind linear regression. The model assumes that these variables depend on the variables that are on the other side of the equation, which are called independent variables.

In simple regression model, model will explain how the dependent variable changes based on independent variable.

As an example, let's imagine that we want to analyze how the sales values are effected based on changes in prices for a given product. If you read this sentence carefully, you can...

Hyperparameters

Before we start, maybe it's better to explain why we call them hyperparameters and not parameters. In machine learning, model parameters can be learned from the data, which means that while you train your model, you fit the model's parameters. On the other hand, we usually set hyperparameters before we start training the model. In order to give an example, you can think of coefficients in regression models as model parameters. A hyperparameter example, we can say the learning rate in many different models or the number of clusters (k) in k-means clustering.

Another important thing is the relationship between model parameters and hyperparameters, and how they shape our machine learning model, in other words, the hypothesis of our model. In machine learning, parameters are used for configuring the model, and this configuration will tailor the algorithm...

Loss and error functions

In the previous subsections, we explain supervised and unsupervised learning. Regardless of which machine learning algorithm is used, our main challenge is regarding issues with optimization. In optimization functions, we are actually trying to minimize the loss function. Imagine a case where you are trying to optimize your monthly savings. In a closed state, what you will do is minimize your spending, in other words, minimize your loss function.

A very common way to build a loss function is starting with the difference between the predicted value and the actual value. In general, we try to estimate the parameters of our model, and then prediction is made. The main measurement that we can use to evaluate how good our prediction is involves calculating the difference between the actual values:

In different models, different loss functions are used. For...

Univariate linear regression with gradient descent

In this subsection, we will implement univariate linear regression for the Boston housing dataset, which we used for exploratory data analysis in the previous chapter. Before we fit the regression line, let's import the necessary libraries and load the dataset as follows:

In [1]: import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline
In [2]: from sklearn.datasets import load_boston
dataset = load_boston()
samples , label, feature_names = dataset.data, dataset.target, dataset.feature_names
In [3]: bostondf = pd.DataFrame(dataset.data)
bostondf.columns = dataset.feature_names
bostondf['Target Price'] = dataset.target
...

Using linear regression to model housing prices

In the section, we will perform multivariate linear regression for the same dataset. In contrast to the previous section, we will use the sklearn library to show you several ways of performing linear regression models. Before we start the linear regression model, we will trim the dataset proportionally from both sides by using the trimboth() method. By doing this, we will cut off the outliers:

In [14]: import numpy as np
import pandas as pd
from scipy import stats
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
In [15]: from sklearn.datasets import load_boston
dataset = load_boston()
In [16]: samples , label, feature_names = dataset.data, dataset.target, dataset.feature_names
In [17]: samples_trim = stats.trimboth(samples, 0.1)
label_trim...

Summary

Linear regression is one of the most common techniques for modeling the relationship between continuous variables. The application of this method is very widely used in the industry. We started modeling part of the book on linear regression, not just because it's very popular, but because it's a relatively easy technique and contains most of the elements which almost every machine learning algorithm has.

In this chapter, we learned about supervised and unsupervised learning and built a linear regression model by using the Boston housing dataset. We touched upon different important concepts such as hyperparameters, loss functions, and gradient descent. The main purpose of this chapter was to give you sufficient knowledge so that you can build and tune a linear regression model and understand what it does step by step. We looked at two practical cases where we...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Numerical Computing with NumPy
Published in: Jun 2018Publisher: PacktISBN-13: 9781788993357
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Umit Mert Cakmak

Umit Mert Cakmak is a data scientist at IBM, where he excels at helping clients solve complex data science problems, from inception to delivery of deployable assets. His research spans multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.
Read more about Umit Mert Cakmak

author image
Tiago Antao

Tiago Antao is a bioinformatician currently working in the field of genomics. A former computer scientist, Tiago moved into computational biology with an MSc in Bioinformatics from the Faculty of Sciences at the University of Porto (Portugal) and a PhD on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine (UK). Postdoctoral, Tiago has worked with human datasets at the University of Cambridge (UK) and with mosquito whole genome sequencing data at the University of Oxford (UK), before helping to set up the bioinformatics infrastructure at the University of Montana. He currently works as a data engineer in the biotechnology field in Boston, MA. He is one of the co-authors of Biopython, a major bioinformatics package written in Python.
Read more about Tiago Antao

author image
Mert Cuhadaroglu

Mert Cuhadaroglu is a BI Developer in EPAM, developing E2E analytics solutions for complex business problems in various industries, mostly investment banking, FMCG, media, communication, and pharma. He consistently uses advanced statistical models and ML algorithms to provide actionable insights. Throughout his career, he has worked in several other industries, such as banking and asset management. He continues his academic research in AI for trading algorithms.
Read more about Mert Cuhadaroglu