Building Machine Learning Systems with Python


Building Machine Learning Systems with Python
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Reviews
Support
Sample Chapters
  • Master Machine Learning using a broad set of Python libraries and start building your own Python-based ML systems
  • Covers classification, regression, feature engineering, and much more guided by practical examples
  • A scenario-based tutorial to get into the right mind-set of a machine learner (data exploration) and successfully implement this in your new or existing projects

Book Details

Language : English
Paperback : 290 pages [ 235mm x 191mm ]
Release Date : July 2013
ISBN : 1782161406
ISBN 13 : 9781782161400
Author(s) : Willi Richert, Luis Pedro Coelho
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source, Python

Table of Contents

Preface
Chapter 1: Getting Started with Python Machine Learning
Chapter 2: Learning How to Classify with Real-world Examples
Chapter 3: Clustering – Finding Related Posts
Chapter 4: Topic Modeling
Chapter 5: Classification – Detecting Poor Answers
Chapter 6: Classification II – Sentiment Analysis
Chapter 7: Regression – Recommendations
Chapter 8: Regression – Recommendations Improved
Chapter 9: Classification III – Music Genre Classification
Chapter 10: Computer Vision – Pattern Recognition
Chapter 11: Dimensionality Reduction
Chapter 12: Big(ger) Data
Appendix: Where to Learn More about Machine Learning
Index
  • Chapter 1: Getting Started with Python Machine Learning
    • Machine learning and Python – the dream team
    • What the book will teach you (and what it will not)
    • What to do when you are stuck
    • Getting started
      • Introduction to NumPy, SciPy, and Matplotlib
      • Installing Python
      • Chewing data efficiently with NumPy and intelligently with SciPy
      • Learning NumPy
        • Indexing
        • Handling non-existing values
        • Comparing runtime behaviors
      • Learning SciPy
    • Our first (tiny) machine learning application
      • Reading in the data
      • Preprocessing and cleaning the data
      • Choosing the right model and learning algorithm
        • Before building our first model
        • Starting with a simple straight line
        • Towards some advanced stuff
        • Stepping back to go forward – another look at our data
        • Training and testing
        • Answering our initial question
    • Summary
    • Chapter 2: Learning How to Classify with Real-world Examples
      • The Iris dataset
        • The first step is visualization
        • Building our first classification model
          • Evaluation – holding out data and cross-validation
      • Building more complex classifiers
      • A more complex dataset and a more complex classifier
        • Learning about the Seeds dataset
        • Features and feature engineering
        • Nearest neighbor classification
      • Binary and multiclass classification
      • Summary
      • Chapter 3: Clustering – Finding Related Posts
        • Measuring the relatedness of posts
          • How not to do it
          • How to do it
        • Preprocessing – similarity measured as similar number of common words
          • Converting raw text into a bag-of-words
          • Counting words
          • Normalizing the word count vectors
          • Removing less important words
          • Stemming
            • Installing and using NLTK
            • Extending the vectorizer with NLTK's stemmer
          • Stop words on steroids
          • Our achievements and goals
        • Clustering
          • KMeans
          • Getting test data to evaluate our ideas on
          • Clustering posts
        • Solving our initial challenge
          • Another look at noise
        • Tweaking the parameters
        • Summary
        • Chapter 4: Topic Modeling
          • Latent Dirichlet allocation (LDA)
            • Building a topic model
          • Comparing similarity in topic space
            • Modeling the whole of Wikipedia
          • Choosing the number of topics
          • Summary
          • Chapter 5: Classification – Detecting Poor Answers
            • Sketching our roadmap
            • Learning to classify classy answers
              • Tuning the instance
              • Tuning the classifier
            • Fetching the data
              • Slimming the data down to chewable chunks
              • Preselection and processing of attributes
              • Defining what is a good answer
            • Creating our first classifier
              • Starting with the k-nearest neighbor (kNN) algorithm
              • Engineering the features
              • Training the classifier
              • Measuring the classifier's performance
              • Designing more features
            • Deciding how to improve
              • Bias-variance and its trade-off
              • Fixing high bias
              • Fixing high variance
              • High bias or low bias
            • Using logistic regression
              • A bit of math with a small example
              • Applying logistic regression to our postclassification problem
            • Looking behind accuracy – precision and recall
            • Slimming the classifier
            • Ship it!
            • Summary
            • Chapter 6: Classification II – Sentiment Analysis
              • Sketching our roadmap
              • Fetching the Twitter data
              • Introducing the Naive Bayes classifier
                • Getting to know the Bayes theorem
                • Being naive
                • Using Naive Bayes to classify
                • Accounting for unseen words and other oddities
                • Accounting for arithmetic underflows
              • Creating our first classifier and tuning it
                • Solving an easy problem first
                • Using all the classes
                • Tuning the classifier's parameters
              • Cleaning tweets
              • Taking the word types into account
                • Determining the word types
                • Successfully cheating using SentiWordNet
                • Our first estimator
                • Putting everything together
              • Summary
              • Chapter 7: Regression – Recommendations
                • Predicting house prices with regression
                  • Multidimensional regression
                  • Cross-validation for regression
                • Penalized regression
                  • L1 and L2 penalties
                  • Using Lasso or Elastic nets in scikit-learn
                • P greater than N scenarios
                  • An example based on text
                  • Setting hyperparameters in a smart way
                  • Rating prediction and recommendations
                • Summary
                • Chapter 8: Regression – Recommendations Improved
                  • Improved recommendations
                    • Using the binary matrix of recommendations
                    • Looking at the movie neighbors
                    • Combining multiple methods
                  • Basket analysis
                    • Obtaining useful predictions
                    • Analyzing supermarket shopping baskets
                    • Association rule mining
                    • More advanced basket analysis
                  • Summary
                  • Chapter 9: Classification III – Music Genre Classification
                    • Sketching our roadmap
                    • Fetching the music data
                      • Converting into a wave format
                    • Looking at music
                      • Decomposing music into sine wave components
                    • Using FFT to build our first classifier
                      • Increasing experimentation agility
                      • Training the classifier
                      • Using the confusion matrix to measure accuracy in multiclass problems
                      • An alternate way to measure classifier performance using receiver operator characteristic (ROC)
                    • Improving classification performance with Mel Frequency Cepstral Coefficients
                    • Summary
                    • Chapter 10: Computer Vision – Pattern Recognition
                      • Introducing image processing
                      • Loading and displaying images
                        • Basic image processing
                          • Thresholding
                          • Gaussian blurring
                          • Filtering for different effects
                        • Adding salt and pepper noise
                          • Putting the center in focus
                        • Pattern recognition
                        • Computing features from images
                        • Writing your own features
                      • Classifying a harder dataset
                      • Local feature representations
                      • Summary
                      • Chapter 11: Dimensionality Reduction
                        • Sketching our roadmap
                        • Selecting features
                          • Detecting redundant features using filters
                            • Correlation
                            • Mutual information
                          • Asking the model about the features using wrappers
                        • Other feature selection methods
                        • Feature extraction
                          • About principal component analysis (PCA)
                            • Sketching PCA
                            • Applying PCA
                          • Limitations of PCA and how LDA can help
                        • Multidimensional scaling (MDS)
                        • Summary
                        • Chapter 12: Big(ger) Data
                          • Learning about big data
                          • Using jug to break up your pipeline into tasks
                            • About tasks
                            • Reusing partial results
                            • Looking under the hood
                            • Using jug for data analysis
                          • Using Amazon Web Services (AWS)
                            • Creating your first machines
                              • Installing Python packages on Amazon Linux
                              • Running jug on our cloud machine
                            • Automating the generation of clusters with starcluster
                          • Summary

                            Willi Richert

                            Willi Richert has a PhD in Machine Learning and Robotics, and he currently works for Microsoft in the Core Relevance Team of Bing, where he is involved in a variety of machine learning areas such as active learning and statistical machine translation.

                            Luis Pedro Coelho

                            Luis Pedro Coelho is a Computational Biologist: someone who uses computers as a tool to understand biological systems. Within this large field, Luis works in Bioimage Informatics, which is the application of machine learning techniques to the analysis of images of biological specimens. His main focus is on the processing of large scale image data. With robotic microscopes, it is possible to acquire hundreds of thousands of images in a day, and visual inspection of all the images becomes impossible. Luis has a PhD from Carnegie Mellon University, which is one of the leading universities in the world in the area of machine learning. He is also the author of several scientific publications. Luis started developing open source software in 1998 as a way to apply to real code what he was learning in his computer science courses at the Technical University of Lisbon. In 2004, he started developing in Python and has contributed to several open source libraries in this language. He is the lead developer on mahotas, the popular computer vision package for Python, and is the contributor of several machine learning codes.

                            Code Downloads

                            Download the code and support files for this book.


                            Submit Errata

                            Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                            Errata

                            - 45 submitted: last submission 04 Jul 2014

                            Errata type: Code | Page number: 30

                            The text says "Let fbt2 be the winning polynomial of degree 2" but it is not shown how it is declared.

                            With fbt2 the author was referring to the polynomial of degree 2 and did not show the code explicitly as it is obvious after having shown how to fit the poly for 1d.

                            In the code file analyze_webstats.py, you can see how it is calculated:

                            train = sorted(shuffled[split_idx:])
                            fbt1 = sp.poly1d(sp.polyfit(xb[train], yb[train], 1))
                            fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2))
                            fbt3 = sp.poly1d(sp.polyfit(xb[train], yb[train], 3))
                            fbt10 = sp.poly1d(sp.polyfit(xb[train], yb[train], 10))
                            fbt100 = sp.poly1d(sp.polyfit(xb[train], yb[train], 100))

                            Errata type: Code | Page number: 30

                            Line 8 in the first code snippet:

                            clf.fit(X, Y)

                            should be:

                            clf.fit(X_train,Y_train)

                            Errata type: Code | Page number: 35

                            # use numpy operations to get setosa features
                            is_setosa = (labels == 'setosa')

                            The variable labels is not defined earlier. Has to replace that code with:

                            is_setosa = (target == 0)

                            Another solution the author suggests is to add the following line before is_setosa = (labels == 'setosa')

                            target_names = data['target_names']
                            labels = target_names[target]

                            Errata type: Code | Page number: 54

                            Next to last code block; method body should be indented:
                            >>> import scipy as sp
                            >>> def dist_raw(v1, v2):
                            ... delta = v1-v2
                            ... return sp.linalg.norm(delta.toarray())

                            Errata type: Code | Page number: 55

                            Missing line where "dist" is set to "dist_raw". The method "dist_raw" is defined on the bottom of page 54, but then "dist" is used on page 55. Confusing. In the downloaded code, you can see the omitted line.

                            Errata type: Code | Page number: 58

                            This should have a line break before u'imagin':
                            s.stem("imagination")u'imagin'

                             

                            Errata type: Code | Page number: 67

                            'comp.sys.ma c.hardware' should not have a space. This was in the results generated by the author: >>> print(len(train_data.filenames) 3414 The actual result is "4119" if the group name is fixed.

                             

                            Errata type: Code | Page number: 68

                            "We now have a pool of 3,414..."
                            See previous error on Page 67. The errant space in the code causes the entire "comp.sys.mac.hardward" group to be skipped.

                             

                            Note from the authors: Code for Chapter 6

                            As you might know, Twitter changes its API from time to time. When the authors edited the book, they asked Twitter whether they would release the data to ease this foreseeable pain point of our users. Unfortunately, this was not possible. Therefore, they have put all the book's code examples on GitHub, where they actively maintain it. In the case of chapter 6, Twitter changed its API in version 1.1 to now require user authentication. Willi has just pushed an updated version to the code that correctly handles the new API 1.1: https://github.com/luispedro/BuildingMachineLearningSystemsWithPython/tree/master/ch06

                            Errata type: Typo | Page number: 261

                            In the Books section the line: If you are interested in that aspect of machine learning, then we recommend Pattern Recognition and Machine Learning, C. Bishop , Springer Apply Italics to this.

                            Should be

                            If you are interested in that aspect of machine learning, then we recommend Pattern Recognition and Machine Learning, C. Bishop, Springer.

                            Errata type: Typo | Page number: 66

                            Line 18, For convenience, the dataset module also contains...

                            Should be

                            For convenience, the datasets module also contains...

                            Errata type: Typo | Page number: 13

                            The Text on the bottom of the page says "In this case, it's a one-dimensional array of five elements."

                            Should be

                            In this case, it's a one-dimensional array of six elements.

                            Errata type: Typo | Page number: 245

                            As we saw in Chapter 10, Computer Vision–Pattern Recognition Finding Related Posts, this can easily be done by changing the computation code feature.

                            Should be:

                            As we saw in Chapter 10, Computer Vision–Pattern Recognition, this can easily be done by changing the computation code feature.

                            Errata type: Typo | Page number: 247

                            We will now look back at Chapter 10, Computer Vision–Pattern Recognition Finding Related Posts.

                            Should be:

                            We will now look back at Chapter 10, Computer Vision–Pattern Recognition.

                            Errata type: Code | Page number: 194

                            X, Y = [],[] in read_ceps() function

                            Should be

                            X, y = [],[]

                            Errata type: Graphics | Page number: 127

                            The third formula should have a log in the last term like this.

                            C_{best}=arg max_{c \in C}log(P(C=c)) + \sum_klog(P(F_{k}|C=c))

                            Errata type: Graphics | Page number: 127

                            The second formula is as follows:

                            C_{best}=arg max_{c \in C}log(P(C=c)) + log(P(F_{1}|C=c)) + log(P(F_{2}|C=c))

                            Suggestion | Page number: 126

                            One of our readers has suggested the second formula to be given as follows: log(P(C) \cdot P(F_{1}|C) \cdot 
                            P(F_{2}|C))=log(P(C))+log(P(F_{1}|C))+log(P(F_{2}|C))

                            The original formula given in the book is, however, correct.

                            Errata type: Code | Page number: 76

                            Last line in code:
                            corpus = corpora.BleiCorpus('./data/ap/ap.dat', '/data/ap/vocab.txt')

                            should be

                            corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')

                            Errata type: Code | Page number: 14

                            In the codesnippet on the bottom array has five instead of six elements. So the result of the operations is different.

                            Errata type: Code | Page number: 71

                            z = (len(post[0]), post[0], dataset.target_names[post[1]]) for post in
                            post_group
                            
                            Has a syntax error and can be fixed like this:
                            
                            z = [(len(post[0]), post[0], dataset.target_names[post[1]]) for post in  
                            post_group]

                            Errata type: Code | Page number: 111

                            last code line: precision_recall_curve(y_test, clf.predict(X_test)

                            Should be

                            precision_recall_curve(y_test, clf.predict(X_test))

                            Errata type: Code | Page number: 110

                            false negative(FN) -> True negative(TN)
                            True negative(TN) -> False negative(FN)

                            that instance is said to be a false negative

                            Should be

                            that instance is said to be a true negative

                            Errata type: Code | Page number: 122

                            In the middle of page:

                            P(F2=1|C="pos") = 2/4 = 0.25

                            Should be:

                            P(F2=1|C="pos") = 2/4 = 0.5

                            Errata type: Code | Page number: 23

                            At the top of the page:

                            print(res) should be print(residuals)

                            Errata type: Code | Page number: 26

                            In the code snippet the before last line is :

                            print("Error inflection=%f" % (fa + fb_error))

                            when it should be the summation of two errors fa_error + fb_error not a function and an error

                            Correct code is:

                            print("Error inflection=%f" % (fa_error + fb_error))

                            Errata type: Code | Page number: 36

                            Accuracy  is simply the fraction of examples that the model classifies correctly:

                            best_acc = -1.0
                            for fi in xrange(features.shape[1]):
                              # We are going to generate all possible threshold for this feature
                              thresh = features[:,fi].copy()
                              thresh.sort()
                              # Now test all thresholds:
                              for t in thresh:
                                pred = (features[:,fi] > t)
                                acc = (pred == virginica).mean()
                                if acc > best_acc:
                                  best_acc = acc
                                  best_fi = fi
                                  best_t = t

                            The correct code should be:

                            Accuracy  is simply the fraction of examples that the model classifies correctly:

                            best_acc = -1.0
                            best_fi = -1.0
                            best_t = -1.0
                            
                            for fi in xrange(features.shape[1]) :
                                thresh = features[:, fi].copy()
                                thresh.sort()
                            
                                for t in thresh:
                                    pred = (features[:,fi] > t)
                                    acc = (labels[pred] == 'virginica').mean()
                                    if acc > best_acc:
                                        best_acc = acc
                                        best_fi = fi
                                        best_t = t

                            Errata type: Typo | Page number: 37

                            If we run it on the whole data, the best model that we get is split on the petal length.

                            Should be:

                            If we run it on the whole data, the best model that we get is split on the petal width.

                            Errata type: Code | Page number: 38 | Location 818

                            "Training error" should be "Training accurancy" 

                            The author has also updated the code available on github

                            Errata type: Code | Page number: 22-23

                            Change from:


                            >>> print(res)

                            to

                            >>> print(residuals)

                            Errata type: Technical | Page number: 123

                            The last element in the first column should "text" instead of "awesome text". However the value of F1 will change to "1" from "0" because the tweet contains "awesome" and F1 denotes the number of times "awesome" appears in the tweet text.

                            Errata type: Code | Page number: 35


                            The first line in the last code snippet should change from
                            " if features[:,2] < 2: print 'Iris Setosa' " to " if fs[2] < 2: print 'Iris Setosa' "

                            Errata type: Code | Page number: 174


                            in first code, line 5,


                            use the following :
                            dataset = [[int(tok) for tok in line.strip().split()]
                            for line in GzipFile('retail.dat.gz')]


                            instead of

                            dataset = [[int(tok) for tok in ,line.strip().split()]
                            for line in GzipFile('retail.dat.gz')]

                            Errata type: Code | Page number: 177

                            use the following :

                            for item in itemset:
                               consequent = frozenset([item])
                               antecendent = itemset=consequent
                               base = 0.0
                               acount = 0.0
                               ccount = 0.0

                            instead of

                            for item in itemset:
                                       antecendent = itemset-consequent
                                       base = 0.0
                                       # acount: antecedent count
                                       acount = 0.0

                            Errata type: Technical | Page number: 205


                            use the following :
                            'Ridler Calvard' method

                            instead of

                            'Ridley Calvard' method

                            Errata type: Code | Page number: 175

                            new_itemsets = []
                            for iset in itemsets:
                            for v in valid:
                            if v not in iset:
                            newset = = (ell | set([v_]))
                            c_newset = 0
                            for d in dataset:
                            if d.issuperset(c):
                            c_newset += 1
                            if c_newset > minsupport:
                            newsets.append(newset)

                            should be:
                            new_itemsets = []
                            for iset in itemsets:
                            for v in valid:
                            if v not in iset:
                            newset = = (iset | set([v]))
                            c_newset = 0
                            for d in dataset:
                            if d.issuperset(newset):
                            c_newset += 1
                            if c_newset > minsupport:
                            new_newsets.append(newset)

                            Type: Technical | Page: 188

                            Replace "only 4 out of 24 jazz songs have been correctly classifi ed—that is
                            only 16 percent"

                            with

                            "only 7 out of 24 jazz songs have been correctly classified - that is only
                            29 percent"

                            Errata type: Code | Page number: 67

                            correction:

                            vectorized = vectorizer.fit_transform(dataset.data)
                            should be
                            vectorized = vectorizer.fit_transform(train_data.data)

                            Errata type: Technical | Page number: 103, 104, 105

                            The graph on these pages should depict the dashed line as "test error" and not "train error".

                             

                            Errata type: Code | Page number: 223

                             

                            Solution: Use
                            >> from scipy.stats import pearsonr

                            instead of
                            >> from import scipy.stats import pearsonr

                            Errata type: Code | Page number: 209

                            use the following :
                            W = np.exp(-2.*(X**2+ Y**2))
                            # Normalize again to 0..1
                            W = W - W.min()
                            W = W / W.ptp()
                            W = W[:,:,None] # This adds a dummy third dimension to W

                            instead of

                            W = np.exp(-2.*(X**2+ Y**2))
                            # Normalize again to 0..1
                            W = W - C.min()
                            W = W / C.ptp()
                            W = C[:,:,None] # This adds a dummy third dimension to W

                            Errata type: Code | Page number: 194

                            On page 194, The indentation should be as follows:

                            def read_ceps(genre_list, base_dir=GENRE_DIR):
                                X, Y = [], []
                                for label, genre in enumerate(genre_list):
                                    for fn in glob.glob(os.path.join(base_dir, genre, "*.ceps.npy")):
                                        ceps = np.load(fn)
                                        num_ceps = len(ceps)
                                        X.append(
                                            np.mean(ceps[int(num_ceps / 10):int(num_ceps * 9 / 10)], axis=0))
                                        y.append(label)
                                return np.array(X), np.array(y)

                            Errata type: Code | Page number: 168

                            use the following:

                            movie_likeness = np.zeros((nmovies,nmovies))


                            instead of

                            movie_likeness = np.zeros((nmovies,nmovies))
                            allms = np.ones(nmovies, bool)
                            cs = np.zeros(nmovies)

                            Errata type: Technical | Page number: 177

                            In the table, replace "13791" by "1379".

                            Errata type: Code |page-162

                             

                            On page 162, replace "return x, xl" by "return xc, xl"

                            Errata type: Technical | Page number: 215

                            The text should be
                            "In this dataset, however, the texture is not a clear marker of the class." instead of
                             

                            "In this dataset, however, the texture is a clear marker of the class."

                            Errata type: Code | Page number: 169

                            use the following:

                            def nn_movie(movie_likeness, reviews, uid, mid):
                            likes = movie_likeness[mid].argsort()
                            # reverse the sorting so that most alike are in
                            # beginning
                            likes = likes[::-1]
                            # returns the rating for the most similar movie available
                            for ell in likes:
                            if reviews[uid,ell] > 0:
                            return reviews[uid,ell]

                            instead of

                            def nn_movie(movie_likeness, reviews, uid, mid):
                            likes = movie_likeness[mid].argsort()
                            # reverse the sorting so that most alike are in
                            # beginning
                            likes = likes[::-1]
                            # returns the rating for the most similar movie available
                            for ell in likes:
                            if reviews[u,ell] > 0:
                            return reviews[u,ell]

                            Errata type: Code | Page number: 254

                            On page 254, under the Running jug on our cloud machine section, the sentence  "We can now download the data and code from Chapter 12, Computer Vision–Pattern
                            Recognition Finding Related Posts,as follows" and the code snippet after it should be removed.

                            Errata type: Code | Page number: 247

                            The code should read:

                            @TaskGenerator
                            def label_for(f):
                                return f[:-(3+1+2)]
                                # 3 for "jpg" 1 for the dot and 2 for the number

                            and then

                            labels = map(label_for, filenames)

                            Errata type: Technical | Page number: 31

                            Use "scikit-learn"

                            instead of "Scikits-learn"

                            Errata type: Code | Page number: 35

                            Use the following
                            "
                            def apply_model( example ):
                            if example[2] < 2: print 'Iris Setosa'
                            else: print 'Iris Virginica or Iris Versicolour'
                            "

                            instead of
                            "
                            if features[:,2] < 2: print 'Iris Setosa'
                            else: print 'Iris Virginica or Iris Versicolour'
                            "

                            Errata type: Typo | Page number: 92

                             

                            Use "PostTypeId" instead of "PostType" in the table.

                            Errata type: Typo | Page number: 167

                            Use below
                            'NumPy ships with np.corrcoef, which computes correlations. '
                            instead of
                            'NumPy ships with np.corrcoeff, which computes correlations. '

                            Errata type: Code | Page number: 207

                             

                            Use the following:
                            lena = mh.imread('lenna.jpg', as_grey=True)

                            instead of
                            im = mh.imread('lenna.jpg', as_grey=True)

                            Errata type:Typo | Page number: 107

                            Use the following:
                            'by replacing y with log(odds)'

                            instead of
                            'by replacing y with p'

                            Errata type:Code | Page number: 112

                            Use:

                            >>> medium = np.argsort(scores)[len(scores) / 2]

                            >>> thresholds = np.hstack(([0],thresholds[medium]))

                            >>> idx80 = precisions>=0.8

                            >>> print("P=%.2f R=%.2f thresh=%.2f" % \ (precision[idx80][0], recall[idx80][0], threshold[idx80][0]))

                            P=0.81 R=0.37 thresh=0.63

                            instead of

                            >>> thresholds = np.hstack(([0],thresholds[medium]))
                            >>> idx80 = precisions>=0.8
                            >>> print("P=%.2f R=%.2f thresh=%.2f" % \ (precision[idx80][0],
                            recall[idx80][0], threshold[idx80][0]))
                            P=0.81 R=0.37 thresh=0.63


                            Errata type:Typo | Page number: 93

                            Use:The PostTypeId attribute, for example, is necessary to distinguish between questions and answers. It will be not picked to serve as a feature, but we will need it to filter the data. 

                            instead of

                            "The PostType attribute, for example, is only necessary to distinguish
                            between questions and answers. Furthermore, we can distinguish between them
                            later by checking for the ParentId attribute. So, we keep it for questions
                            too, and set it to 1.

                             

                            Type: Code Page: 99

                            # which we don't want to count
                                link_count_in_code += len(link_match.findall(match_str))
                                links = link_match.findall(s)
                                link_count = len(links)
                                link_count -= link_count_in_code
                                html_free_s = re.sub(" +", " ", tag_match.sub('',
                                    code_free_s)).replace("\n", "")
                                link_free_s = html_free_s

                            Type: Code Page: 99

                            The code snippet should be:
                            ...

                             links = link_match.findall(s)

                             link_count = len(links)

                             link_count -= link_count_in_code

                             html_free_s = re.sub(" +", " ",

                            tag_match.sub('',  code_free_s)).replace("\n", "")

                             link_free_s = html_free_s

                             

                             # remove links from text before counting words

                             for link in links:

                            if link.lower().startswith("http://"):

                                          link_free_s = link_free_s.replace(link,'')
                            ...

                            Page 131 type:graphics

                            The graph on this page should indicate the P/R AUC value as 67 percent

                            Page 207 type:typo

                            Use the following:
                            'Only about 5 percent of these values will be true.'
                            instead of
                            'Only 1 percent of these values will be true.'

                            Sample chapters

                            You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                            Frequently bought together

                            Building Machine Learning Systems with Python +    Haskell Data Analysis Cookbook =
                            50% Off
                            the second eBook
                            Price for both: $43.05

                            Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                            What you will learn from this book

                            • Build a classification system that can be applied to text, images, or sounds
                            • Use scikit-learn, a Python open-source library for machine learning
                            • Explore the mahotas library for image processing and computer vision
                            • Build a topic model of the whole of Wikipedia
                            • Get to grips with recommendations using the basket analysis
                            • Use the Jug package for data analysis
                            • Employ Amazon Web Services to run analyses on the cloud
                            • Recommend products to users based on past purchases

                            In Detail

                            Machine learning, the field of building systems that learn from data, is exploding on the Web and elsewhere. Python is a wonderful language in which to develop machine learning applications. As a dynamic language, it allows for fast exploration and experimentation and an increasing number of machine learning libraries are developed for Python.

                            Building Machine Learning system with Python shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail.

                            Using open-source tools and libraries, readers will learn how to apply methods to text, images, and sounds. You will also learn how to evaluate, compare, and choose machine learning techniques.

                            Written for Python programmers, Building Machine Learning Systems with Python teaches you how to use open-source libraries to solve real problems with machine learning. The book is based on real-world examples that the user can build on.

                            Readers will learn how to write programs that classify the quality of StackOverflow answers or whether a music file is Jazz or Metal. They will learn regression, which is demonstrated on how to recommend movies to users. Advanced topics such as topic modeling (finding a text’s most important topics), basket analysis, and cloud computing are covered as well as many other interesting aspects.

                            Building Machine Learning Systems with Python will give you the tools and understanding required to build your own systems, which are tailored to solve your problems.

                            Approach

                            A practical, scenario-based tutorial, this book will help you get to grips with machine learning with Python and start building your own machine learning projects. By the end of the book you will have learnt critical aspects of machine learning Python projects and experienced the power of ML-based systems by actually working on them.

                            Who this book is for

                            This book is for Python programmers who are beginners in machine learning, but want to learn Machine learning. Readers are expected to know Python and be able to install and use open-source libraries. They are not expected to know machine learning, although the book can also serve as an introduction to some Python libraries for readers who know machine learning. This book does not go into the detail of the mathematics behind the algorithms.

                            This book primarily targets Python developers who want to learn and build machine learning in their projects, or who want to provide machine learning support to their existing projects, and see them getting implemented effectively.

                            Code Download and Errata
                            Packt Anytime, Anywhere
                            Register Books
                            Print Upgrades
                            eBook Downloads
                            Video Support
                            Contact Us
                            Awards Voting Nominations Previous Winners
                            Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                            Resources
                            Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software