Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Building Machine Learning Systems with Python

You're reading from  Building Machine Learning Systems with Python

Product type Book
Published in Jul 2013
Publisher Packt
ISBN-13 9781782161400
Pages 290 pages
Edition 1st Edition
Languages

Table of Contents (20) Chapters

Building Machine Learning Systems with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
1. Getting Started with Python Machine Learning 2. Learning How to Classify with Real-world Examples 3. Clustering – Finding Related Posts 4. Topic Modeling 5. Classification – Detecting Poor Answers 6. Classification II – Sentiment Analysis 7. Regression – Recommendations 8. Regression – Recommendations Improved 9. Classification III – Music Genre Classification 10. Computer Vision – Pattern Recognition 11. Dimensionality Reduction 12. Big(ger) Data Where to Learn More about Machine Learning Index

Chapter 12. Big(ger) Data

While computers keep getting faster and have more memory, the size of the data has grown as well. In fact, data has grown faster than computational speed, and this means that it has grown faster than our ability to process it.

It is not easy to say what is big data and what is not, so we will adopt an operational definition: when data is so large that it becomes too cumbersome to work with, we refer to it as big data. In some areas, this might mean petabytes of data or trillions of transactions; data that will not fit into a single hard drive. In other cases, it may be one hundred times smaller, but just difficult to work with.

We will first build upon some of the experience of the previous chapters and work with what we can call the medium data setting (not quite big data, but not small either). For this we will use a package called jug, which allows us to do the following:

  • Break up your pipeline into tasks

  • Cache (memoize) intermediate results

  • Make use of multiple cores...

Learning about big data


The expression "big data" does not mean a specific amount of data, neither in the number of examples nor in the number of gigabytes, terabytes, or petabytes taken up by the data. It means the following:

  • We have had data growing faster than the processing power

  • Some of the methods and techniques that worked well in the past now need to be redone, as they do not scale well

  • Your algorithms cannot assume that the entire data is in RAM

  • Managing data becomes a major task in itself

  • Using computer clusters or multicore machines becomes a necessity and not a luxury

This chapter will focus on this last piece of the puzzle: how to use multiple cores (either on the same machine or on separate machines) to speed up and organize your computations. This will also be useful in other medium-sized data tasks.

Using jug to break up your pipeline into tasks


Often, we have a simple pipeline: we preprocess the initial data, compute features, and then we need to call a machine learning algorithm with the resulting features.

Jug is a package developed by Luis Pedro Coelho, one of the authors of this book. It is open source (using the liberal MIT License) and can be useful in many areas but was designed specifically around data analysis problems. It simultaneously solves several problems, for example:

  • It can memorize results to a disk (or a database), which means that if you ask it to compute something you have computed before, the result is instead read from the disk.

  • It can use multiple cores or even multiple computers on a cluster. Jug was also designed to work very well in batch computing environments that use a queuing system such as Portable Batch System (PBS), the Load Sharing Facility (LSF), or the Oracle Grid Engine (OGE, earlier known as Sun Grid Engine). This will be used in the second half...

Using Amazon Web Services (AWS)


When you have a lot of data and a lot of computation, you might start to crave for more computing power. Amazon (aws.amazon.com/) allows you to rent computing power by the hour. Thus, you can access a large amount of computing power without having to precommit by purchasing a large number of machines (including the costs of managing the infrastructure). There are other competitors in this market, but Amazon is the largest player, so we briefly cover it here.

Amazon Web Services (AWS) is a large set of services. We will focus only on the Elastic Compute Cluster (EC2) service. This service offers you virtual machines and disk space, which can be allocated and deallocated quickly.

There are three modes of use: a reserved mode, whereby you prepay to have cheaper per-hour access; a fixed per-hour rate; and a variable rate which depends on the overall compute market (when there is less demand, the costs are lower; when there is more demand, the prices go up).

For...

Summary


We saw how to use jug, a little Python framework, to manage computations in a way that takes advantage of multiple cores or multiple machines. Although this framework is generic, it was built specifically to address the data analysis needs of its author (who is also an author of this book). Therefore, it has several aspects that make it fit in with the rest of the Python machine learning environment.

We also learned about AWS, the Amazon cloud. Using cloud computing is often a more effective use of resources than building an in-house computing capacity. This is particularly true if your needs are not constant, but changing. Starcluster even allows for clusters that automatically grow as you launch more jobs and shrink as they terminate.

This is the end of the book. We have come a long way. We learned how to perform classification when we have labeled data and clustering when we do not. We learned about dimensionality reduction and topic modeling to make sense of large datasets. Towards...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Building Machine Learning Systems with Python
Published in: Jul 2013 Publisher: Packt ISBN-13: 9781782161400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}