Reader small image

You're reading from  Large Scale Machine Learning with Python

Product typeBook
Published inAug 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781785887215
Edition1st Edition
Languages
Right arrow
Authors (2):
Bastiaan Sjardin
Bastiaan Sjardin
author image
Bastiaan Sjardin

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.
Read more about Bastiaan Sjardin

Alberto Boschetti
Alberto Boschetti
author image
Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti

View More author details
Right arrow

Out-of-core CART with H2O


Up until now, we have only dealt with desktop solutions for CART models. In Chapter 4, Neural Networks and Deep Learning, we introduced H2O for deep learning out of memory that provided a powerful scalable method. Luckily, H2O also provides tree ensemble methods utilizing its powerful parallel Hadoop ecosystem. As we covered GBM and random forest extensively in previous sections, let's get to it right away. For this exercise, we will use the spam dataset that we used before.

Random forest and gridsearch on H2O

Let's implement a random forest with gridsearch hyperparameter optimization. In this section, we first load the spam dataset from the URL source:

import pandas as pd
import numpy as np
import os
import xlrd
import urllib
import h2o

#set your path here
os.chdir('/yourpath/')

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
filename='spamdata.data'
urllib.urlretrieve(url, filename)

Now that we have loaded the data, we can...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Large Scale Machine Learning with Python
Published in: Aug 2016Publisher: PacktISBN-13: 9781785887215

Authors (2)

author image
Bastiaan Sjardin

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.
Read more about Bastiaan Sjardin

author image
Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti