Data | 0 articles | Tech News, Tutorials & Expert Insights

19 Dec 2014

50 min read

Supervised learning

19 Dec 2014

0
0
2731

article-image-navigation-mesh-generation

Packt

19 Dec 2014

9 min read

Navigation Mesh Generation

Packt

19 Dec 2014

9 min read

0
0
4378

Packt

19 Dec 2014

11 min read

Evolving the data model

Packt

19 Dec 2014

11 min read

0
0
2689

Packt

17 Dec 2014

24 min read

Mastering Splunk: Lookups

Packt

17 Dec 2014

24 min read

0
0
10411

Packt

16 Dec 2014

9 min read

Adding Graded Activities

Packt

16 Dec 2014

9 min read

This article by Rebecca Barrington, author of Moodle Gradebook Second Edition, teaches you how to add assignments and set up how they will be graded, including how to use our custom scales and add outcomes for grading. (For more resources related to this topic, see here.) As with all content within Moodle, we need to select Turn editing on within the course in order to be able to add resources and activities. All graded activities are added through the Add an activity or resource text available within each section of within a Moodle course. This text can be found in the bottom right of each section after editing has been turned on. There are a number of items that can be graded and will appear within the Gradebook. Assignments are the most feature-rich of all the graded activities and have many options available in order to customize how assessments can be graded. They can be used to provide assessment information for students, store grades, and provide feedback. When setting up the assignment, we can choose for students to submit their work electronically—either through file submission or online text, or we can review the assessment offline and use only the grade and feedback features of the assignment. Adding assignments There are many options *within the assignments, and throughout this article we will set up a number of different assignments and you'll learn about some of their most useful features and options. Let's have a go at creating a range of assignments that are ready for grading. Creating an assignment with a scale The first assignment that we will add will *make use of the PMD scale Click on the Turn editing on button. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 1). In the Description box, provide some assignment details. In the Availability section, we need to disable the date options. We will not make use of these options, but they can be very useful. To disable the options, click on the tick next to the Enable text. However, details of these options have *been provided for future* reference. The Allow submissions from section* is mostly relevant when the assignment will be submitted electronically, as students won't be able to submit their work until the date and time indicated here. The Due date section* can be used to indicate when the assignment needs to be submitted by. If students electronically submit their assignment after the date and time indicated here, the submission date and time will be shown in red in order to notify the teacher that it was submitted past the due date. The Cut off date section* enables teachers to set an extension period after the due date where late submissions will continue to be accepted. In the* Submission types section, ensure *that the File submissions checkbox is enabled by adding a tick there. This will enable students to submit their assignment electronically. There are additional options that we can choose as well. With Maximum number of uploaded files, we can indicate how many files a student can upload. Keep this as 1. We can also determine the Maximum submission size option for each file using the drop-down list shown in the following screenshot: Within the Feedback types section, ensure that all options under the Feedback types *section are *selected. Feedback comments enables *us to provide written feedback along with the grade. Feedback files enables us *to upload a file in order to provide feedback to a student. Offline grading worksheet will *provide us with the option to download a .csv file that contains core information about the assignment, and this can be used to add grades and feedback while working offline. This completed .csv file can be uploaded and the grades will be added to the assignments within the Gradebook. In the Submission settings section, we have options related to how students will submit their assignment and how they will reattempt submission if required. If Require students click submit button is left as No, students will upload* their assignment* and it will be available *to the teacher for grading. If this option is changed to Yes, students can upload their assignment, but the teacher will see that it is in the draft form. Students will click on Submit to indicate that it is ready to be graded. Require that students accept the submission statement will provide students *with a statement that they need to agree to when they submit their assignment. The default statement is This assignment is my own work, except where I have acknowledged the use of works of other people. The submission statement can be changed by a site administrator by navigating to Site administration | plugins | Activity modules | Assignment settings. The Attempts reopened drop-down list* provides options for the status of the assignment after it has been graded. Students will only be able to resubmit their work when it is open. Therefore this setting will control when and if students are able to submit another version of their assignment. The options available to us are:Never: This option should be selected if students will not be able to submit another piece of work.Manually: This will enable anyone who has the role of a teacher to choose to reopen a submission that enables a student to submit their work again.Automatically until pass: This option works when a pass grade is set within the Gradebook. After grading, if the student is awarded the minimum pass *grade or higher, the submission *will remain closed in order to prevent any changes to the submission. However, if the assignment is graded lower than the assigned pass grade, the submission will automatically reopen in order to enable the student to submit the assignment again.Maximum attempts: The maximum *attempts allowed for this assignment will limit the number of times an assignment is reopened. For example, if this option is set to 3, then a student will only be able to submit their assignment three times. After they have submitted their assignment for a third time, they will not be allowed to submit it again. The default is unlimited, but it can be changed by clicking on the drop-down list. In the Submission settings section, ensure that the options for Require students click on submit button and Require that students accept the submission statement are set to Yes. Also, change the Attempts reopened to Automatically until passed. Within the Grade section, navigate to Grade | Type | Scale and choose the PMD scale. Select Use marking workflow by changing the drop-down list to Yes.Use marking workflow is a new feature of Moodle 2.6* that enables *the grading process to go through a range of stages in order to indicate that the marking is in progress or is complete, is being reviewed, or is ready for release to students. Click on* Save and return to course. Creating an online assignment with a number grade The next *assignment that we will create will have an online* text option that will have a maximum grade of 20. The following steps show you how to create an online assignment with a number grade: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 2). In the Description box, provide the assignment details. In the Submission types section, ensure that Online text has a tick next to it. This will enable students to type directly into Moodle. When choosing this option, we can also set a maximum word limit by clicking on the tick box next to the Enable text. After enabling this option, we can add a number to the textbox. For this assignment, enable a word limit of 200 words. When using* online text* submission, we have an additional feedback option within the Feedback types section. Under the Comment inline text, click on No and switch to Yes to enable yourself to add written feedback for students within the written text submitted by students. In the Submission settings section, ensure that the options for Require students click submit button and Require that students accept the submission statement are set to Yes. Also, change Attempts reopened to Automatically until passed. Within the Grades section, navigate to Grade | Type | Point and ensure that Maximum points is set to 20. Click *on* Save and return to course. Creating an assignment including outcomes The next assignment that we will *create will add some of the Outcomes: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 3). In the Description box, provide the assignment details. In the Submission types box, ensure that Online text and File submissions are selected. Set Maximum number of uploaded files to 2. In the Submission settings section, ensure that the options for Require students to click submit button and Require that students accept the submission statement are amended to Yes. Change Attempts reopened to Manually. Within the Grades section, navigate to Grade | Type | Point and Maximum points is set to 100. In the Outcomes *section, choose the outcomes as Evidence provided and Criteria 1 met. Scroll to the bottom of the screen and click on Save and return to course. Summary In this article, we added a range of assignments that made use of number and scale grades as well as added outcomes to an assignment. Resources for Article: Further resources on this subject: Moodle for Online Communities [article] What's New in Moodle 2.0 [article] Moodle 2.0: What's New in Add a Resource [article]

0
0
2660

Packt

16 Dec 2014

9 min read

Ridge Regression

Packt

16 Dec 2014

9 min read

In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]], val y: DblVector, val lambda: Double) { extends AbstractMultipleLinearRegression with PipeOperator[Array[T], Double] { private var qr: QRDecomposition = null private[this] val model: Option[RegressionModel] = … … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1 val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price .drop(1) .zip(_price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.get .zip(volume.get) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4 regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]

0
0
6786

Packt

27 Nov 2014

17 min read

About MongoDB

Packt

27 Nov 2014

17 min read

In this article by Amol Nayak, the author of MongoDB Cookbook, describes the various features of MongoDB. (For more resources related to this topic, see here.) MongoDB is a document-oriented database and is the most popular and favorite NoSQL database. The rankings given at http://db-engines.com/en/ranking shows us that MongoDB is sitting on the fifth rank overall as of August 2014 and is the first NoSQL product in this list. It is currently being used in production by a huge list of companies in various domains handling terabytes of data efficiently. MongoDB is developed to scale horizontally and cope up with the increasing data volumes. It is very simple to use and get started with, backed by a good support from its company MongoDB and has a vast array open source and proprietary tools build around it to improve developer and administrator's productivity. In this article, we will cover the following recipes: Single node installation of MongoDB with options from the config file Viewing database stats Creating an index and viewing plans of queries Single node installation of MongoDB with options from the config file As we're aware that providing options from the command line does the work, but it starts getting awkward as soon as the number of options we provide increases. We have a nice and clean alternative to providing the startup options from a configuration file rather than as command-line arguments. Getting ready Well, assuming that we have downloaded the MongoDB binaries from the download site, extracted it, and have the bin directory of MongoDB in the operating system's path variable (this is not mandatory but it really becomes convenient after doing it), the binaries can be downloaded from http://www.mongodb.org/downloads after selecting your host operating system. How to do it… The /data/mongo/db directory for the database and /logs/ for the logs should be created and present on your filesystem, with the appropriate permissions to write to it. Let's take a look at the steps in detail: Create a config file, which can have any arbitrary name. In our case, let's say we create the file at /conf/mongo.conf. We will then edit the file and add the following lines of code to it: port = 27000 dbpath = /data/mongo/db logpath = /logs/mongo.log smallfiles = true Start the Mongo server using the following command: > mongod --config /config/mongo.conf How it works… The properties are specified as <property name> = <value>. For all those properties that don't have values, for example, the smallfiles option, the value given is a Boolean value, true. If you need to have a verbose output, you will add v=true (or multiple v's to make it more verbose) to our config file. If you already know what the command-line option is, it is pretty easy to guess the value of the property in the file. It is the same as the command-line option, with just the hyphen removed. Viewing database stats In this recipe, we will see how to get the statistics of a database. Getting ready To find the stats of the database, we need to have a server up and running, and a single node is what should be ok. The data on which we would be operating needs to be imported into the database. Once these steps are completed, we are all set to go ahead with this recipe. How to do it… We will be using the test database for the purpose of this recipe. It already has the postalCodes collection in it. Let's take a look at the steps in detail: Connect to the server using the Mongo shell by typing in the following command from the operating system terminal (it is assumed that the server is listening to port 27017): $ mongo On the shell, execute the following command and observe the output: > db.stats() Now, execute the following command, but this time with the scale parameter (observe the output): > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } How it works… Let us start by looking at the collections field. If you look carefully at the number and also execute the show collections command on the Mongo shell, you shall find one extra collection in the stats as compared to those by executing the command. The difference is for one collection, which is hidden, and its name is system.namespaces. You may execute db.system.namespaces.find() to view its contents. Getting back to the output of stats operation on the database, the objects field in the result has an interesting value too. If we find the count of documents in the postalCodes collection, we see that it is 39732. The count shown here is 39738, which means there are six more documents. These six documents come from the system.namespaces and system.indexes collection. Executing a count query on these two collections will confirm it. Note that the test database doesn't contain any other collection apart from postalCodes. The figures will change if the database contains more collections with documents in it. The scale parameter, which is a parameter to the stats function, divides the number of bytes with the given scale value. In this case, it is 1024, and hence, all the values will be in KB. Let's analyze the output: > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } The following table shows the meaning of the important fields: Field Description db This is the name of the database whose stats are being viewed. collections This is the total number of collections in the database. objects This is the count of documents across all collections in the database. If we find the stats of a collection by executing db.<collection>.stats(), we get the count of documents in the collection. This attribute is the sum of counts of all the collections in the database. avgObjectSize This is simply the size (in bytes) of all the objects in all the collections in the database, divided by the count of the documents across all the collections. This value is not affected by the scale provided even though this is a size field. dataSize This is the total size of the data held across all the collections in the database. This value is affected by the scale provided. storageSize This is the total amount of storage allocated to collections in this database for storing documents. This value is affected by the scale provided. numExtents This is the count of all the number of extents in the database across all the collections. This is basically the sum of numExtents in the collection stats for collections in this database. indexes This is the sum of number of indexes across all collections in the database. indexSize This is the size (in bytes) for all the indexes of all the collections in the database. This value is affected by the scale provided. fileSize This is simply the addition of the size of all the database files you should find on the filesystem for this database. The files will be named test.0, test.1, and so on for the test database. This value is affected by the scale provided. nsSizeMB This is the size of the file in MBs for the .ns file of the database. Another thing to note is the value of the avgObjectSize, and there is something weird in this value. Unlike this very field in the collection's stats, which is affected by the value of the scale provided. In database stats, this value is always in bytes, which is pretty confusing and one cannot really be sure why this is not scaled according to the provided scale. Creating an index and viewing plans of queries In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes. Getting ready For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. The data with which we will be operating needs to be imported in the database. Once we have this prerequisite, we are good to go. How to do it… We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps: Execute the following query to view the plan of a query: > db.postalCodes.find({state:'Maharashtra'}).explain() Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation Let's execute the same query again, but this time, we will limit the results to only 100 results: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result We will now create an index on the state and pincode fields as follows: > db.postalCodes.ensureIndex({state:1, pincode:1}) Execute the following query: > db.postalCodes.find({state:'Maharashtra'}).explain() Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result Since we want only the pin codes, we will modify the query as follows and view its plan: > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result. How it works… There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some gotchas. Analysis of the plan Let's look at the first step and analyze the output we executed: > db.postalCodes.find({state:'Maharashtra'}).explain() The output on my machine is as follows (I am skipping the nonrelevant fields for now): { "cursor" : "BasicCursor", "n" : 6446, "nscannedObjects" : 39732, "nscanned" : 39732, … "millis" : 55, … } The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result. Improving the query execution time So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() { "cursor" : "BasicCursor", … "n" : 100, "nscannedObjects" : 19951, "nscanned" : 19951, … "millis" : 30, … } The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important, as in the absence of index, the collection might still be completely scanned if the number of matches is not met. Improvement using indexes Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement: { "cursor" : "BtreeCursor state_1_pincode_1", … "n" : 6446, "nscannedObjects" : 6446, "nscanned" : 6446, … "indexOnly" : false, … "millis" : 16, … } The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we now used an index that gave us the starting document from which we could scan, and then, only the required number of documents were scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis has come down too, as expected. Improvement using covered indexes This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate. Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another IO operation to fetch the complete document from the collection; this document is then returned to the user. The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection will not be necessary, and the data from the index will be returned. This is called covered index, and the value of indexOnly, in this case, will be true. In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query results in the following plan: db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() { "cursor" : "BtreeCursor state_1_pincode_1", … "n" : 6446, "nscannedObjects" : 0, "nscanned" : 6446, … "indexOnly" : true, … "millis" : 15, … } The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index, and thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional IO was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0. As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance. Another thing to remember is that wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to use it set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection. Some gotchas of index creations We will now see some pitfalls in index creation and some facts where the array field is used in the index. Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries. What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, not more than one field can be of the array type. This is done to prevent the possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields. This should be good enough for now to scratch the surfaces of plain vanilla index. Summary This article provides detailed recipes that describe how to use the different features of MongoDB. MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability, thus making it a good contender for high-volume, high-performance systems across all business domains. It has an edge over the majority of NoSQL solutions for its ease of use, high performance, and rich features. In this article, we learned how to start single node installations of MongoDB with options from the config file. We also learned how to create an index from the shell and viewing plans of queries. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] MongoDB data modeling [Article] Using Mongoid [Article]

0
0
3259

Packt

27 Nov 2014

9 min read

Logistic regression

Packt

27 Nov 2014

9 min read

0
0
1775

article-image-creating-reusable-actions-agent-behaviors-lua

Packt

27 Nov 2014

18 min read

Creating reusable actions for agent behaviors with Lua

Packt

27 Nov 2014

18 min read

0
0
1650

article-image-machine-learning-examples-applicable-businesses

Packt

25 Nov 2014

7 min read

Machine Learning Examples Applicable to Businesses

Packt

25 Nov 2014

7 min read

The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[ output == 0, density(get(colPred), adjust = 0.5) ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[ output == 1, density(get(colPred), adjust = 0.5) ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend( 'top', c('0', '1'), pch = 16, col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows: The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]

0
0
1349

Packt

25 Nov 2014

4 min read

No to nodistinct

Packt

25 Nov 2014

4 min read

This article is written by Stephen Redmond, the author of Mastering QlikView. There is a great skill in creating the right expression to calculate the right answer. Being able to do this in all circumstances relies on having a good knowledge of creating advanced expressions. Of course, the best path to mastery in this subject is actually getting out and doing it, but there is a great argument here for regularly practicing with dummy or test datasets. (For more resources related to this topic, see here.) When presented with a problem that needs to be solved, all the QlikView masters will not necessarily know immediately how to answer it. What they will have though is a very good idea of where to start, that is, what to try and what not to try. This is what I hope to impart to you here. Knowing how to create many advanced expressions will arm you to know where to apply them—and where not to apply them. This is one area of QlikView that is alien to many people. For some reason, they fear the whole idea of concepts such as Aggr. However, the reality is that these concepts are actually very simple and supremely logical. Once you get your head around them, you will wonder what all the fuss was about. No to nodistinct The Aggr function has as an optional clause, that is, the possibility of stating that the aggregation will be either distinct or nodistinct. The default option is distinct, and as such, is rarely ever stated. In this default operation, the aggregation will only produce distinct results for every combination of dimensions—just as you would expect from a normal chart or straight table. The nodistinct option only makes sense within a chart, one that has more dimensions than are in the Aggr statement. In this case, the granularity of the chart is lower than the granularity of Aggr, and therefore, QlikView will only calculate that Aggr for the first occurrence of lower granularity dimensions and will return null for the other rows. If we specify nodistinct, the same result will be calculated across all of the lower granularity dimensions. This can be difficult to understand without seeing an example, so let's look at a common use case for this option. We will start with a dataset: ProductSales:Load * Inline [Product, Territory, Year, SalesProduct A, Territory A, 2013, 100Product B, Territory A, 2013, 110Product A, Territory B, 2013, 120Product B, Territory B, 2013, 130Product A, Territory A, 2014, 140Product B, Territory A, 2014, 150Product A, Territory B, 2014, 160Product B, Territory B, 2014, 170]; We will build a report from this data using a pivot table: Now, we want to bring the value in the Total column into a new column under each year, perhaps to calculate a percentage for each year. We might think that, because the total is the sum for each Product and Territory, we might use an Aggr in the following manner: Sum(Aggr(Sum(Sales), Product, Territory)) However, as stated previously, because the chart includes an additional dimension (Year) than Aggr, the expression will only be calculated for the first occurrence of each of the lower granularity dimensions (in this case, for Year = 2013): The commonly suggested fix for this is to use Aggr without Sum and with nodistinct as shown: Aggr(NoDistinct Sum(Sales), Product, Territory) This will allow the Aggr expression to be calculated across all the Year dimension values, and at first, it will appear to solve the problem: The problem occurs when we decide to have a total row on this chart: As there is no aggregation function surrounding Aggr, it does not total correctly at the Product or Territory dimensions. We can't add an aggregation function, such as Sum, because it will break one of the other totals. However, there is something different that we can do; something that doesn't involve Aggr at all! We can use our old friend Total: Sum(Total<Product, Territory> Sales) This will calculate correctly at all the levels: There might be other use cases for using a nodistinct clause in Aggr, but they should be reviewed to see whether a simpler Total will work instead. Summary We discussed an important function, the Aggr function. We now know that the Aggr function is extremely useful, but we don't need to apply it in all circumstances where we have vertical calculations. Resources for Article: Further resources on this subject: Common QlikView script errors [article] Introducing QlikView elements [article] Creating sheet objects and starting new list using Qlikview 11 [article]

0
0
2023

article-image-understanding-hbase-ecosystem

Packt

24 Nov 2014

11 min read

Understanding the HBase Ecosystem

Packt

24 Nov 2014

11 min read

This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: user-subscribe@hbase.apache.org The developer list: dev-subscribe@hbase.apache.org Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]

0
0
15533

Packt

18 Nov 2014

17 min read

The plot function

Packt

18 Nov 2014

17 min read

0
0
10842

Packt

13 Nov 2014

9 min read

The HBase's Data Storage

Packt

13 Nov 2014

9 min read

In this article by Nishant Garg author of HBase Essentials, we will look at HBase's data storage from its architectural view point. (For more resources related to this topic, see here.) For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security and so on). Let's start with data storage in HBase first. Data storage In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage: HLog (the write-ahead log file, also known as WAL) HFile (the real data storage file) In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables. In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table is removed. The .META table is renamed as hbase:meta. Now, the location of .META is stored in Zookeeper. The following is the structure of the hbase:meta table. Key—the region key of the format ([table],[region start key],[region id]). A region with an empty start key is the first region in a table. The values are as follows: info:regioninfo(serialized the HRegionInfo instance for this region) info:server(server:port of the RegionServer containing this region) info:serverstartcode(start time of the RegionServer process that contains this region) When the table is split, two new columns will be created as info:splitA and info:splitB. These columns represent the two newly created regions. The values for these columns are also serialized as HRegionInfo instances. Once the split process is complete, the row that contains the old region information is deleted. In the case of data reading, the client application first connects to ZooKeeper and looks up the location of the hbase:meta table. For the next client, the HTable instance queries the hbase:meta table and finds out the region that contains the rows of interest and also locate the region server that is serving the identified region. The information about the region and region server is then cached by the client application for future interactions and avoids the lookup process. If the region is reassigned by the load balancer process or if the region server has expired, fresh lookup is done on the hbase:meta catalog table to get the new location of the user table region and cache is updated accordingly. At the object level, the HRegionServer class is responsible to create a connection with the region by creating HRegion objects. This HRegion instance sets up a store instance that has one or more StoreFile instances (wrapped around HFile) and MemStore. MemStore accumulates the data edits as it happens and buffers them into the memory. This is also important for accessing the recent edits of table data. As shown in the preceding diagram, the HRegionServer instance (the region server) contains the map of HRegion instances (regions) and also has an HLog instance that represents the WAL. There is a single block cache instance at the region-server level, which holds data from all the regions hosted on that region server. A block cache instance is created at the time of the region server startup and it can have an implementation of LruBlockCache, SlabCache, or BucketCache. The block cache also supports multilevel caching; that is, a block cache might have first-level cache, L1, as LruBlockCache and second-level cache, L2, as SlabCache or BucketCache. All these cache implementations have their own way of managing the memory; for example, LruBlockCache is like a data structure and resides on the JVM heap whereas the other two types of implementation also use memory outside of the JVM heap. HLog (the write-ahead log – WAL) In the case of writing the data, when the client calls HTable.put(Put), the data is first written to the write-ahead log file (which contains actual data and sequence number together represented by the HLogKey class) and also written in MemStore. Writing data directly into MemStrore can be dangerous as it is a volatile in-memory buffer and always open to the risk of losing data in case of a server failure. Once MemStore is full, the contents of the MemStore are flushed to the disk by creating a new HFile on the HDFS. While inserting data from the HBase shell, the flush command can be used to write the in-memory (memstore) data to the store files. If there is a server failure, the WAL can effectively retrieve the log to get everything up to where the server was prior to the crash failure. Hence, the WAL guarantees that the data is never lost. Also, as another level of assurance, the actual write-ahead log resides on the HDFS, which is a replicated filesystem. Any other server having a replicated copy can open the log. The HLog class represents the WAL. When an HRegion object is instantiated, the single HLog instance is passed on as a parameter to the constructor of HRegion. In the case of an update operation, it saves the data directly to the shared WAL and also keeps track of the changes by incrementing the sequence numbers for each edits. WAL uses a Hadoop SequenceFile, which stores records as sets of key-value pairs. Here, the HLogKey instance represents the key, and the key-value represents the rowkey, column family, column qualifier, timestamp, type, and value along with the region and table name where data needs to be stored. Also, the structure starts with two fixed-length numbers that indicate the size and value of the key. The following diagram shows the structure of a key-value pair: The WALEdit class instance takes care of atomicity at the log level by wrapping each update. For example, in the case of a multicolumn update for a row, each column is represented as a separate KeyValue instance. If the server fails after updating few columns to the WAL, it ends up with only a half-persisted row and the remaining updates are not persisted. Atomicity is guaranteed by wrapping all updates that comprise multiple columns into a single WALEdit instance and writing it in a single operation. For durability, a log writer's sync() method is called, which gets the acknowledgement from the low-level filesystem on each update. This method also takes care of writing the WAL to the replication servers (from one datanode to another). The log flush time can be set to as low as you want, or even be kept in sync for every edit to ensure high durability but at the cost of performance. To take care of the size of the write ahead log file, the LogRoller instance runs as a background thread and takes care of rolling log files at certain intervals (the default is 60 minutes). Rolling of the log file can also be controlled based on the size and hbase.regionserver.logroll.multiplier. It rotates the log file when it becomes 90 percent of the block size, if set to 0.9. HFile (the real data storage file) HFile represents the real data storage file. The files contain a variable number of data blocks and fixed number of file info blocks and trailer blocks. The index blocks records the offsets of the data and meta blocks. Each data block contains a magic header and a number of serialized KeyValue instances. The default size of the block is 64 KB and can be as large as the block size. Hence, the default block size for files in HDFS is 64 MB, which is 1,024 times the HFile default block size but there is no correlation between these two blocks. Each key-value in the HFile is represented as a low-level byte array. Within the HBase root directory, we have different files available at different levels. Write-ahead log files represented by the HLog instances are created in a directory called WALs under the root directory defined by the hbase.rootdir property in hbase-site.xml. This WALs directory also contains a subdirectory for each HRegionServer. In each subdirectory, there are several write-ahead log files (because of log rotation). All regions from that region server share the same HLog files. In HBase, every table also has its own directory created under the data/default directory. This data/default directory is located under the root directory defined by the hbase.rootdir property in hbase-site.xml. Each table directory contains a file called .tableinfo within the .tabledesc folder. This .tableinfo file stores the metadata information about the table, such as table and column family schemas, and is represented as the serialized HTableDescriptor class. Each table directory also has a separate directory for every region comprising the table, and the name of this directory is created using the MD5 hash portion of a region name. The region directory also has a .regioninfo file that contains the serialized information of the HRegionInfo instance for the given region. Once the region exceeds the maximum configured region size, it splits and a matching split directory is created within the region directory. This size is configured using the hbase.hregion.max.filesize property or the configuration done at the column-family level using the HColumnDescriptor instance. In the case of multiple flushes by the MemStore, the number of files might get increased on this disk. The compaction process running in the background combines the files to the largest configured file size and also triggers region split. Summary In this article, we have learned about the internals of HBase and how it stores the data. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Advanced Hadoop MapReduce Administration [Article] HBase Administration, Performance Tuning [Article]

0
0
5620

Packt

04 Nov 2014

23 min read

Postmodel Workflow

Packt

04 Nov 2014

23 min read

This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes: K-fold cross validation Automatic cross validation Cross validation with ShuffleSplit Stratified k-fold Poor man's grid search Brute force grid search Using dummy estimators to compare results (For more resources related to this topic, see here.) Even though by design the articles are unordered, you could argue by virtue of the art of data science, we've saved the best for last. For the most part, each recipe within this article is applicable to the various models we've worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the "best" model. We'll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting. K-fold cross validation In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes. Getting ready We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters. How to do it... First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset: >>> N = 1000>>> holdout = 200>>> from sklearn.datasets import make_regression>>> X, y = make_regression(1000, shuffle=True) Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would: >>> X_h, y_h = X[:holdout], y[:holdout]>>> X_t, y_t = X[holdout:], y[holdout:]>>> from sklearn.cross_validation import KFold K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True. Let's create the cross validation object: >>> kfold = KFold(len(y_t), n_folds=4) Now, we can iterate through the k-fold object: >>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(kfold): print output_string.format(i, len(y_t[train]), len(y_t[test]))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Each iteration should return the same split size. How it works... It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t). From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures. We're going to mix it up and use pandas for this part: >>> import numpy as np>>> import pandas as pd>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)>>> measurements = pd.DataFrame({'patient_id': patients, 'ys': np.random.normal(0, 1, 800)}) Now that we have the data, we only want to hold out certain customers instead of data points: >>> custids = np.unique(measurements.patient_id)>>> customer_kfold = KFold(custids.size, n_folds=4)>>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(customer_kfold): train_cust_ids = custids[train] training = measurements[measurements.patient_id.isin( train_cust_ids)] testing = measurements[~measurements.patient_id.isin( train_cust_ids)] print output_string.format(i, len(training), len(testing))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Automatic cross validation We've looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance. Getting ready First, we'll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it'll be a random forest. We'll then create a dataset and use the cross validation functions. How to do it... First import the ensemble module and we'll get started: >>> from sklearn import ensemble>>> rf = ensemble.RandomForestRegressor(max_features='auto') Okay, so now, let's create some regression data: >>> from sklearn import datasets>>> X, y = datasets.make_regression(10000, 10) Now that we have the data, we can import the cross_validation module and get access to the functions we'll use: >>> from sklearn import cross_validation>>> scores = cross_validation.cross_val_score(rf, X, y)>>> print scores[ 0.86823874 0.86763225 0.86986129] How it works... For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel. We can activate verbose mode play by play: >>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)[CV] no parameters to be set[CV] no parameters to be set, score=0.872866 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.873679 - 0.6s[CV] no parameters to be set[CV] no parameters to be set, score=0.878018 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.871598 - 0.6s[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs. It's also worth knowing that we can score our function predicated on which kind of model we're trying to fit. Cross validation with ShuffleSplit ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified. Getting ready ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation. How to do it... First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean: >>> import numpy as np>>> true_loc = 1000>>> true_scale = 10>>> N = 1000>>> dataset = np.random.normal(true_loc, true_scale, N)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');>>> ax.set_title("Histogram of dataset");>>> f.savefig("978-1-78398-948-5_06_06.png") NumPy will give the following output: Now, let's take the first half of the data and guess the mean: >>> from sklearn import cross_validation>>> holdout_set = dataset[:500]>>> fitting_set = dataset[500:]>>> estimate = fitting_set[:N/2].mean()>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.set_title("True Mean vs Regular Estimate")>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate')>>> ax.set_xlim(999, 1001)>>> ax.legend()>>> f.savefig("978-1-78398-948-5_06_07.png") We'll get the following output: Now, we can use ShuffleSplit to fit the estimator on several smaller datasets: >>> from sklearn.cross_validation import ShuffleSplit>>> shuffle_split = ShuffleSplit(len(fitting_set))>>> mean_p = []>>> for train, _ in shuffle_split: mean_p.append(fitting_set[train].mean()) shuf_estimate = np.mean(mean_p)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate')>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5, alpha=.65, label='shufflesplit estimate')>>> ax.set_title("All Estimates")>>> ax.set_xlim(999, 1001)>>> ax.legend(loc=3) The output will be as follows: As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate. Stratified k-fold In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions. Getting ready We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal. We'll then plot the class proportions at each step to illustrate how the class proportions are maintained: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11]) Let's check the overall class weight distribution: >>> y.mean()0.90300000000000002 Roughly, 90.5 percent of the samples are 1, with the balance 0. How to do it... Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit: >>> from sklearn import cross_validation>>> n_folds = 50>>> strat_kfold = cross_validation.StratifiedKFold(y, n_folds=n_folds)>>> shuff_split = cross_validation.ShuffleSplit(n=len(y), n_iter=n_folds)>>> kfold_y_props = []>>> shuff_y_props = []>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold, shuff_split): kfold_y_props.append(y[k_train].mean()) shuff_y_props.append(y[s_train].mean()) Now, let's plot the proportions over each fold: >>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit", color='k')>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified", color='k', ls='--')>>> ax.set_title("Comparing class proportions.")>>> ax.legend(loc='best') The output will be as follows: We can see that the proportion of each fold for stratified k-fold is stable across folds. How it works... Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels: >>> import numpy as np>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5], size=1000)>>> import itertools as it>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5): print np.bincount(three_classes[train])[ 0 90 314 395][ 0 90 314 395][ 0 90 314 395][ 0 91 315 395][ 0 91 315 396] As we can see, we got roughly the sample sizes of each class for our training and testing proportions. Poor man's grid search In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization. Getting ready In this recipe, we will perform the following tasks: Design a basic search grid in the parameter space Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset Choose the point in the parameter space that minimizes/maximizes the evaluation function Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization: The parameter space will then be the Cartesian product of the those two sets: We'll see in a bit how we can iterate through this space with itertools. Let's create the dataset and then get started: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=2000, n_features=10) How to do it... Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them: >>> criteria = {'gini', 'entropy'}>>> max_features = {'auto', 'log2', None}>>> import itertools as it>>> parameter_space = it.product(criteria, max_features) Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50: import numpy as nptrain_set = np.random.choice([True, False], size=len(y))from sklearn.tree import DecisionTreeClassifieraccuracies = {}for criterion, max_feature in parameter_space: dt = DecisionTreeClassifier(criterion=criterion, max_features=max_feature) dt.fit(X[train_set], y[train_set]) accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set]) == y[~train_set]).mean()>>> accuracies{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini','auto'): 0.9638671875, ('gini', 'log2'): 0.96875} So we now have the accuracies and its performance. Let's visualize the performance: >>> from matplotlib import pyplot as plt>>> from matplotlib import cm>>> cmap = cm.RdBu_r>>> f, ax = plt.subplots(figsize=(7, 4))>>> ax.set_xticklabels([''] + list(criteria))>>> ax.set_yticklabels([''] + list(max_features))>>> plot_array = []>>> for max_feature in max_features:m = []>>> for criterion in criteria: m.append(accuracies[(criterion, max_feature)]) plot_array.append(m)>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values()) - 0.001, vmax=np.max(accuracies.values()) + 0.001, cmap=cmap)>>> f.colorbar(colors) The following is the output: It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method. How it works... This works fairly simply, we just have to perform the following steps: Choose a set of parameters. Iterate through them and find the accuracy of each step. Find the best performer by visual inspection. Brute force grid search In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods. We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space. Getting ready To get started, we'll need to perform the following steps: Create some classification data. We'll then create a LogisticRegression object that will be the model we're fitting. After that, we'll create the search objects, GridSearch and RandomizedSearchCV. How to do it... Run the following code to create some classification data: >>> from sklearn.datasets import make_classification>>> X, y = make_classification(1000, n_features=5) Now, we'll create our logistic regression object: >>> from sklearn.linear_model import LogisticRegression>>> lr = LogisticRegression(class_weight='auto') We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample: >>> lr.fit(X, y)LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75}, dual=False,fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]} The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution: >>> import scipy.stats as st>>> import numpy as np>>> random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)} How it works... Now, we'll fit the classifier. This works by passing lr to the parameter search objects: >>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV>>> gs = GridSearchCV(lr, grid_search_params) GridSearchCV implements the same API as the other models: >>> gs.fit(X, y)GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0, class_weight='auto', dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None, verbose=0) As we can see with the param_grid parameter, our penalty and C are both arrays. To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search: >>> gs.grid_scores_[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}] We might want to get the max score: >>> gs.grid_scores_[1][1]0.90100000000000002>>> max(gs.grid_scores_, key=lambda x: x[1])mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1} The parameters obtained are the best choices for our logistic regression. Using dummy estimators to compare results This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build. Getting ready In this recipe, we'll perform the following tasks: Create some data random data. Fit the various dummy estimators. We'll perform these two steps for regression data and classification data. How to do it... First, we'll create the random data: >>> from sklearn.datasets import make_regression, make_classification# classification if for later>>> X, y = make_regression()>>> from sklearn import dummy>>> dumdum = dummy.DummyRegressor()>>> dumdum.fit(X, y)DummyRegressor(constant=None, strategy='mean') By default, the estimator will predict by just taking the mean of the values and predicting the mean values: >>> dumdum.predict(X)[:5]array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907]) There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value. Supplying a constant will only be considered if strategy is "constant". Let's have a look: >>> predictors = [("mean", None), ("median", None), ("constant", 10)]>>> for strategy, constant in predictors: dumdum = dummy.DummyRegressor(strategy=strategy, constant=constant)>>> dumdum.fit(X, y)>>> print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5]))strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0 We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems: >>> predictors = [("constant", 0), ("stratified", None), ("uniform", None), ("most_frequent", None)] We'll also need to create some classification data: >>> X, y = make_classification()>>> for strategy, constant in predictors: dumdum = dummy.DummyClassifier(strategy=strategy, constant=constant) dumdum.fit(X, y) print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5]))strategy: constant 0,0,0,0,0strategy: stratified 1,0,0,1,0strategy: uniform 0,0,0,1,1strategy: most_frequent 1,1,1,1,1 How it works... It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud. We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems: >>> X, y = make_classification(20000, weights=[.95, .05])>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')>>> dumdum.fit(X, y)DummyClassifier(constant=None, random_state=None, strategy='most_frequent')>>> from sklearn.metrics import accuracy_score>>> print accuracy_score(y, dumdum.predict(X))0.94575 We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. Summary This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine Learning in IPython with scikit-learn [article] Our First Machine Learning Method – Linear Classification [article]

0
0
2245

How-To Tutorials - Data

Supervised learning

Navigation Mesh Generation

Evolving the data model

Mastering Splunk: Lookups

Adding Graded Activities

Ridge Regression

About MongoDB

Logistic regression

Creating reusable actions for agent behaviors with Lua

Machine Learning Examples Applicable to Businesses

Trending Topics

No to nodistinct

Understanding the HBase Ecosystem

The plot function

The HBase's Data Storage

Postmodel Workflow

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access