Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-supervised-learning
Packt
19 Dec 2014
50 min read
Save for later

Supervised learning

Packt
19 Dec 2014
50 min read
In this article by Dan Toomey, author of the book R for Data Science, we will learn about the supervised learning, which involves the use of a target variable and a number of predictor variables that are put into a model to enable the system to predict the target. This is also known as predictive modeling. (For more resources related to this topic, see here.) As mentioned, in supervised learning we have a target variable and a number of possible predictor variables. The objective is to associate the predictor variables in such a way so as to accurately predict the target variable. We are using some portion of observed data to learn how our model behaves and then testing that model on the remaining observations for accuracy. We will go over the following supervised learning techniques: Decision trees Regression Neural networks Instance based learning (k-NN) Ensemble learning Support vector machines Bayesian learning Bayesian inference Random forests Decision tree For decision tree machine learning, we develop a logic tree that can be used to predict our target value based on a number of predictor variables. The tree has logical points, such as if the month is December, follow the tree logic to the left; otherwise, follow the tree logic to the right. The last leaf of the tree has a predicted value. For this example, we will use the weather data in the rattle package. We will develop a decision tree to be used to determine whether it will rain tomorrow or not based on several variables. Let's load the rattle package as follows: > library(rattle) We can see a summary of the weather data. This shows that we have some real data over a year from Australia: > summary(weather)      Date                     Location     MinTemp     Min.   :2007-11-01   Canberra     :366   Min.   :-5.300 1st Qu.:2008-01-31   Adelaide     : 0   1st Qu.: 2.300 Median :2008-05-01   Albany       : 0   Median : 7.450 Mean   :2008-05-01   Albury       : 0   Mean   : 7.266 3rd Qu.:2008-07-31   AliceSprings : 0   3rd Qu.:12.500 Max.   :2008-10-31   BadgerysCreek: 0   Max.   :20.900                      (Other)     : 0                      MaxTemp         Rainfall       Evaporation       Sunshine     Min.   : 7.60   Min.   : 0.000   Min.  : 0.200   Min.   : 0.000 1st Qu.:15.03   1st Qu.: 0.000   1st Qu.: 2.200   1st Qu.: 5.950 Median :19.65   Median : 0.000   Median : 4.200   Median : 8.600 Mean   :20.55   Mean   : 1.428   Mean   : 4.522   Mean   : 7.909 3rd Qu.:25.50   3rd Qu.: 0.200   3rd Qu.: 6.400   3rd Qu.:10.500 Max.   :35.80   Max.   :39.800   Max.   :13.800   Max.   :13.600                                                    NA's   :3       WindGustDir   WindGustSpeed   WindDir9am   WindDir3pm NW     : 73   Min.   :13.00   SE     : 47   WNW   : 61 NNW   : 44   1st Qu.:31.00   SSE   : 40   NW     : 61 E     : 37   Median :39.00   NNW   : 36   NNW   : 47 WNW   : 35   Mean   :39.84   N     : 31   N     : 30 ENE   : 30   3rd Qu.:46.00   NW     : 30   ESE   : 27 (Other):144   Max.   :98.00   (Other):151   (Other):139 NA's   : 3   NA's   :2       NA's   : 31   NA's   : 1 WindSpeed9am     WindSpeed3pm   Humidity9am     Humidity3pm   Min.   : 0.000   Min.   : 0.00   Min.   :36.00   Min.   :13.00 1st Qu.: 6.000   1st Qu.:11.00   1st Qu.:64.00   1st Qu.:32.25 Median : 7.000   Median :17.00   Median :72.00   Median :43.00 Mean   : 9.652   Mean   :17.99   Mean   :72.04   Mean   :44.52 3rd Qu.:13.000   3rd Qu.:24.00   3rd Qu.:81.00   3rd Qu.:55.00 Max.   :41.000   Max.   :52.00   Max.   :99.00   Max.   :96.00 NA's   :7                                                       Pressure9am     Pressure3pm       Cloud9am       Cloud3pm   Min.   : 996.5   Min.   : 996.8   Min.   :0.000   Min.   :0.000 1st Qu.:1015.4   1st Qu.:1012.8   1st Qu.:1.000   1st Qu.:1.000 Median :1020.1   Median :1017.4   Median :3.500   Median :4.000 Mean   :1019.7   Mean   :1016.8   Mean   :3.891   Mean   :4.025 3rd Qu.:1024.5   3rd Qu.:1021.5   3rd Qu.:7.000   3rd Qu.:7.000 Max.   :1035.7   Max.   :1033.2   Max.   :8.000   Max.   :8.000 Temp9am         Temp3pm         RainToday RISK_MM Min.   : 0.100   Min.   : 5.10   No :300   Min.   : 0.000 1st Qu.: 7.625   1st Qu.:14.15   Yes: 66   1st Qu.: 0.000 Median :12.550   Median :18.55             Median : 0.000 Mean   :12.358   Mean   :19.23             Mean   : 1.428 3rd Qu.:17.000   3rd Qu.:24.00             3rd Qu.: 0.200 Max.   :24.700   Max.   :34.50           Max.   :39.800                                                            RainTomorrow No :300     Yes: 66       We will be using the rpart function to develop a decision tree. The rpart function looks like this: rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) The various parameters of the rpart function are described in the following table: Parameter Description formula This is the formula used for the prediction. data This is the data matrix. weights These are the optional weights to be applied. subset This is the optional subset of rows of data to be used. na.action This specifies the action to be taken when y, the target value, is missing. method This is the method to be used to interpret the data. It should be one of these: anova, poisson, class, or exp. If not specified, the algorithm decides based on the layout of the data. … These are the additional parameters to be used to control the behavior of the algorithm.  Let's create a subset as follows: > weather2 <- subset(weather,select=-c(RISK_MM)) > install.packages("rpart") >library(rpart) > model <- rpart(formula=RainTomorrow ~ .,data=weather2, method="class") > summary(model) Call: rpart(formula = RainTomorrow ~ ., data = weather2, method = "class") n= 366   CPn split       rel error     xerror   xstd 1 0.19696970     0 1.0000000 1.0000000 0.1114418 2 0.09090909      1 0.8030303 0.9696970 0.1101055 3 0.01515152     2 0.7121212 1.0151515 0.1120956 4 0.01000000     7 0.6363636 0.9090909 0.1073129   Variable importance Humidity3pm WindGustSpeed     Sunshine WindSpeed3pm       Temp3pm            24           14          12             8             6 Pressure3pm       MaxTemp       MinTemp   Pressure9am       Temp9am            6             5             4             4             4 Evaporation         Date   Humidity9am     Cloud3pm     Cloud9am             3             3             2             2             1      Rainfall            1 Node number 1: 366 observations,   complexity param=0.1969697 predicted class=No   expected loss=0.1803279 P(node) =1    class counts:   300   66    probabilities: 0.820 0.180 left son=2 (339 obs) right son=3 (27 obs) Primary splits:    Humidity3pm < 71.5   to the left, improve=18.31013, (0 missing)    Pressure3pm < 1011.9 to the right, improve=17.35280, (0 missing)    Cloud3pm   < 6.5     to the left, improve=16.14203, (0 missing)    Sunshine   < 6.45   to the right, improve=15.36364, (3 missing)    Pressure9am < 1016.35 to the right, improve=12.69048, (0 missing) Surrogate splits:    Sunshine < 0.45   to the right, agree=0.945, adj=0.259, (0 split) (many more)… As you can tell, the model is complicated. The summary shows the progression of the model development using more and more of the data to fine-tune the tree. We will be using the rpart.plot package to display the decision tree in a readable manner as follows: > library(rpart.plot) > fancyRpartPlot(model,main="Rain Tomorrow",sub="Chapter 12") This is the output of the fancyRpartPlot function Now, we can follow the logic of the decision tree easily. For example, if the humidity is over 72, we are predicting it will rain. Regression We can use a regression to predict our target value by producing a regression model from our predictor variables. We will be using the forest fire data from http://archive.ics.uci.edu. We will load the data and get the following summary: > forestfires <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") > summary(forestfires)        X               Y           month     day         FFMC     Min.   :1.000   Min.   :2.0   aug   :184   fri:85 Min.   :18.70 1st Qu.:3.000   1st Qu.:4.0   sep   :172   mon:74   1st Qu.:90.20 Median :4.000   Median :4.0   mar   : 54   sat:84   Median :91.60 Mean   :4.669   Mean   :4.3   jul   : 32   sun:95   Mean   :90.64 3rd Qu.:7.000   3rd Qu.:5.0  feb   : 20   thu:61   3rd Qu.:92.90 Max.   :9.000   Max.   :9.0   jun   : 17   tue:64   Max.   :96.20                                (Other): 38   wed:54                      DMC             DC             ISI             temp     Min.   : 1.1   Min.   : 7.9   Min.   : 0.000   Min.   : 2.20 1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500   1st Qu.:15.50 Median :108.3   Median :664.2   Median : 8.400   Median :19.30 Mean   :110.9   Mean   :547.9   Mean   : 9.022   Mean   :18.89 3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800   3rd Qu.:22.80 Max.   :291.3   Max.   :860.6   Max.   :56.100   Max.   :33.30                                                                         RH             wind           rain             area       Min.   : 15.00   Min.   :0.400   Min.   :0.00000   Min.   :   0.00 1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000   1st Qu.:   0.00 Median : 42.00   Median :4.000   Median :0.00000   Median :   0.52 Mean   : 44.29   Mean   :4.018   Mean   :0.02166   Mean   : 12.85 3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000   3rd Qu.:   6.57 Max.   :100.00   Max.   :9.400   Max.   :6.40000   Max.   :1090.84 I will just use the month, temperature, wind, and rain data to come up with a model of the area (size) of the fires using the lm function. The lm function looks like this: lm(formula, data, subset, weights, na.action,    method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,    singular.ok = TRUE, contrasts = NULL, offset, ...) The various parameters of the lm function are described in the following table: Parameter Description formula This is the formula to be used for the model data This is the dataset subset This is the subset of dataset to be used weights These are the weights to apply to factors … These are the additional parameters to be added to the function Let's load the data as follows: > model <- lm(formula = area ~ month + temp + wind + rain, data=forestfires) Looking at the generated model, we see the following output: > summary(model) Call: lm(formula = area ~ month + temp + wind + rain, data = forestfires) Residuals:    Min     1Q Median     3Q     Max -33.20 -14.93   -9.10   -1.66 1063.59 Coefficients:            Estimate Std. Error t value Pr(>|t|) (Intercept) -17.390     24.532 -0.709   0.4787 monthaug     -10.342     22.761 -0.454   0.6498 monthdec     11.534     30.896   0.373   0.7091 monthfeb       2.607     25.796   0.101   0.9196 monthjan       5.988     50.493   0.119   0.9056 monthjul     -8.822    25.068 -0.352   0.7251 monthjun     -15.469     26.974 -0.573   0.5666 monthmar     -6.630     23.057 -0.288   0.7738 monthmay       6.603     50.053   0.132   0.8951 monthnov     -8.244     67.451 -0.122   0.9028 monthoct     -8.268    27.237 -0.304   0.7616 monthsep     -1.070     22.488 -0.048   0.9621 temp           1.569     0.673   2.332   0.0201 * wind           1.581     1.711   0.924   0.3557 rain         -3.179     9.595 -0.331   0.7406 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1   Residual standard error: 63.99 on 502 degrees of freedom Multiple R-squared: 0.01692, Adjusted R-squared: -0.0105 F-statistic: 0.617 on 14 and 502 DF, p-value: 0.8518 Surprisingly, the month has a significant effect on the size of the fires. I would have guessed that whether or not the fires occurred in August or similar months would have effected any discernable difference. Also, the temperature has such a minimal effect. Further, the model is using the month data as categorical. If we redevelop the model (without temperature), we have a better fit (notice the multiple R-squared value drops to 0.006 from 0.01), as shown here: > model <- lm(formula = area ~ month + wind + rain, data=forestfires) > summary(model)   Call: lm(formula = area ~ month + wind + rain, data = forestfires)   Residuals:    Min     1Q Median     3Q     Max -22.17 -14.39 -10.46   -3.87 1072.43   Coefficients:           Estimate Std. Error t value Pr(>|t|) (Intercept)   4.0126   22.8496   0.176   0.861 monthaug     4.3132   21.9724   0.196   0.844 monthdec     1.3259   30.7188   0.043   0.966 monthfeb     -1.6631   25.8441 -0.064   0.949 monthjan     -6.1034   50.4475 -0.121   0.904 monthjul     6.4648   24.3021   0.266   0.790 monthjun     -2.4944   26.5099 -0.094   0.925 monthmar     -4.8431   23.1458 -0.209   0.834 monthmay     10.5754   50.2441   0.210   0.833 monthnov     -8.7169   67.7479 -0.129   0.898 monthoct     -0.9917   27.1767 -0.036   0.971 monthsep     10.2110   22.0579   0.463   0.644 wind         1.0454     1.7026   0.614   0.540 rain         -1.8504     9.6207 -0.192   0.848   Residual standard error: 64.27 on 503 degrees of freedom Multiple R-squared: 0.006269, Adjusted R-squared: -0.01941 F-statistic: 0.2441 on 13 and 503 DF, p-value: 0.9971 From the results, we can see R-squared of close to 0 and p-value almost 1; this is a very good fit. If you plot the model, you will get a series of graphs. The plot of the residuals versus fitted values is the most revealing, as shown in the following graph: > plot(model) You can see from the graph that the regression model is very accurate: Neural network In a neural network, it is assumed that there is a complex relationship between the predictor variables and the target variable. The network allows the expression of each of these relationships. For this model, we will use the liver disorder data from http://archive.ics.uci.edu. The data has a few hundred observations from patients with liver disorders. The variables are various measures of blood for each patient as shown here: > bupa <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data") > colnames(bupa) <- c("mcv","alkphos","alamine","aspartate","glutamyl","drinks","selector") > summary(bupa)      mcv           alkphos         alamine     Min.   : 65.00   Min.   : 23.00   Min.   : 4.00 1st Qu.: 87.00   1st Qu.: 57.00   1st Qu.: 19.00 Median : 90.00   Median : 67.00   Median : 26.00 Mean   : 90.17   Mean   : 69.81   Mean   : 30.36 3rd Qu.: 93.00   3rd Qu.: 80.00   3rd Qu.: 34.00 Max.   :103.00   Max.   :138.00   Max.   :155.00    aspartate       glutamyl         drinks     Min.   : 5.00   Min.   : 5.00   Min.  : 0.000 1st Qu.:19.00   1st Qu.: 15.00   1st Qu.: 0.500 Median :23.00   Median : 24.50   Median : 3.000 Mean   :24.64   Mean   : 38.31   Mean   : 3.465 3rd Qu.:27.00   3rd Qu.: 46.25   3rd Qu.: 6.000 Max.   :82.00   Max.   :297.00   Max. :20.000    selector   Min.   :1.000 1st Qu.:1.000 Median :2.000 Mean   :1.581 3rd Qu.:2.000 Max.   :2.000 We generate a neural network using the neuralnet function. The neuralnet function looks like this: neuralnet(formula, data, hidden = 1, threshold = 0.01,                stepmax = 1e+05, rep = 1, startweights = NULL,          learningrate.limit = NULL,          learningrate.factor = list(minus = 0.5, plus = 1.2),          learningrate=NULL, lifesign = "none",          lifesign.step = 1000, algorithm = "rprop+",          err.fct = "sse", act.fct = "logistic",          linear.output = TRUE, exclude = NULL,          constant.weights = NULL, likelihood = FALSE) The various parameters of the neuralnet function are described in the following table: Parameter Description formula This is the formula to converge. data This is the data matrix of predictor values. hidden This is the number of hidden neurons in each layer. stepmax This is the maximum number of steps in each repetition. Default is 1+e5. rep This is the number of repetitions. Let's generate the neural network as follows: > nn <- neuralnet(selector~mcv+alkphos+alamine+aspartate+glutamyl+drinks, data=bupa, linear.output=FALSE, hidden=2) We can see how the model was developed via the result.matrix variable in the following output: > nn$result.matrix                                      1 error                 100.005904355153 reached.threshold       0.005904330743 steps                 43.000000000000 Intercept.to.1layhid1   0.880621509705 mcv.to.1layhid1       -0.496298308044 alkphos.to.1layhid1     2.294158313786 alamine.to.1layhid1     1.593035613921 aspartate.to.1layhid1 -0.407602506759 glutamyl.to.1layhid1   -0.257862634340 drinks.to.1layhid1     -0.421390527261 Intercept.to.1layhid2   0.806928998059 mcv.to.1layhid2       -0.531926150470 alkphos.to.1layhid2     0.554627946150 alamine.to.1layhid2     1.589755874579 aspartate.to.1layhid2 -0.182482440722 glutamyl.to.1layhid2   1.806513419058 drinks.to.1layhid2     0.215346602241 Intercept.to.selector   4.485455617018 1layhid.1.to.selector   3.328527160621 1layhid.2.to.selector   2.616395644587 The process took 43 steps to come up with the neural network once the threshold was under 0.01 (0.005 in this case). You can see the relationships between the predictor values. Looking at the network developed, we can see the hidden layers of relationship among the predictor variables. For example, sometimes mcv combines at one ratio and on other times at another ratio, depending on its value. Let's load the neural network as follows: > plot(nn) Instance-based learning R programming has a nearest neighbor algorithm (k-NN). The k-NN algorithm takes the predictor values and organizes them so that a new observation is applied to the organization developed and the algorithm selects the result (prediction) that is most applicable based on nearness of the predictor values in the new observation. The nearest neighbor function is knn. The knn function call looks like this: knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE) The various parameters of the knn function are described in the following table: Parameter Description train This is the training data. test This is the test data. cl This is the factor of true classifications. k This is the Number of neighbors to consider. l This is the minimum vote for a decision. prob This is a Boolean flag to return proportion of winning votes. use.all This is a Boolean variable for tie handling. TRUE means use all votes of max distance I am using the auto MPG dataset in the example of using knn. First, we load the dataset : > data <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", na.string="?") > colnames(data) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model.year","origin","car.name") > summary(data)      mpg         cylinders     displacement     horsepower Min.   : 9.00  Min.   :3.000   Min.   : 68.0   150   : 22 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90     : 20 Median :23.00   Median :4.000   Median :148.5   88     : 19 Mean   :23.51   Mean   :5.455   Mean   :193.4   110   : 18 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100   : 17 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14                                                  (Other):288      weight     acceleration     model.year       origin     Min.   :1613   Min. : 8.00   Min.   :70.00   Min.   :1.000 1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000 Median :2804   Median :15.50   Median :76.00   Median :1.000 Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573 3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000 Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000                                                                           car.name ford pinto   : 6 amc matador   : 5 ford maverick : 5 toyota corolla: 5 amc gremlin   : 4 amc hornet   : 4 (Other)       :369   There are close to 400 observations in the dataset. We need to split the data into a training set and a test set. We will use 75 percent for training. We use the createDataPartition function in the caret package to select the training rows. Then, we create a test dataset and a training dataset using the partitions as follows: > library(caret) > training <- createDataPartition(data$mpg, p=0.75, list=FALSE) > trainingData <- data[training,] > testData <- data[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg) NAs introduced by coercion The error message means that some numbers in the dataset have a bad format. The bad numbers were automatically converted to NA values. Then the inclusion of the NA values caused the function to fail, as NA values are not expected in this function call. First, there are some missing items in the dataset loaded. We need to eliminate those NA values as follows: > completedata <- data[complete.cases(data),] After looking over the data several times, I guessed that the car name fields were being parsed as numerical data when there was a number in the name, such as Buick Skylark 320. I removed the car name column from the test and we end up with the following valid results; > drops <- c("car.name") > completeData2 <- completedata[,!(names(completedata) %in% drops)] > training <- createDataPartition(completeData2$mpg, p=0.75, list=FALSE) > trainingData <- completeData2[training,] > testData <- completeData2[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg) We can see the results of the model by plotting using the following command. However, the graph doesn't give us much information to work on. > plot(model) We can use a different kknn function to compare our model with the test data. I like this version a little better as you can plainly specify the formula for the model. Let's use the kknn function as follows: > library(kknn) > model <- kknn(formula = formula(mpg~.), train = trainingData, test = testData, k = 3, distance = 1) > fit <- fitted(model) > plot(testData$mpg, fit) > abline(a=0, b=1, col=3) I added a simple slope to highlight how well the model fits the training data. It looks like as we progress to higher MPG values, our model has a higher degree of variance. I think that means we are missing predictor variables, especially for the later model, high MPG series of cars. That would make sense as government mandate and consumer demand for high efficiency vehicles changed the mpg for vehicles. Here is the graph generated by the previous code: Ensemble learning Ensemble learning is the process of using multiple learning methods to obtain better predictions. For example, we could use a regression and k-NN, combine the results, and end up with a better prediction. We could average the results of both or provide heavier weight towards one or another of the algorithms, whichever appears to be a better predictor. Support vector machines We covered support vector machines (SVM), but I will run through an example here. As a reminder, SVM is concerned with binary data. We will use the spam dataset from Hewlett Packard (part of the kernlab package). First, let's load the data as follows: > library(kernlab) > data("spam") > summary(spam)      make           address           all             num3d         Min.   :0.0000   Min.   : 0.000   Min.   :0.0000   Min.   : 0.00000 1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 0.00000 Median :0.0000   Median : 0.000   Median :0.0000   Median : 0.00000 Mean   :0.1046   Mean   : 0.213   Mean   :0.2807   Mean   : 0.06542 3rd Qu.:0.0000   3rd Qu.: 0.000   3rd Qu.:0.4200   3rd Qu.: 0.00000 Max.   :4.5400   Max.   :14.280   Max.   :5.1000   Max.   :42.81000 … There are 58 variables with close to 5000 observations, as shown here: > table(spam$type) nonspam   spam    2788   1813 Now, we break up the data into a training set and a test set as follows: > index <- 1:nrow(spam) > testindex <- sample(index, trunc(length(index)/3)) > testset <- spam[testindex,] > trainingset <- spam[-testindex,] Now, we can produce our SVM model using the svm function. The svm function looks like this: svm(formula, data = NULL, ..., subset, na.action =na.omit, scale = TRUE) The various parameters of the svm function are described in the following table: Parameter Description formula This is the formula model data This is the dataset subset This is the subset of the dataset to be used na.action This contains what action to take with NA values scale This determines whether to scale the data Let's use the svm function to produce a SVM model as follows: > library(e1071) > model <- svm(type ~ ., data = trainingset, method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1) > summary(model) Call: svm(formula = type ~ ., data = trainingset, method = "C-classification",    kernel = "radial", cost = 10, gamma = 0.1) Parameters:    SVM-Type: C-classification SVM-Kernel: radial        cost: 10      gamma: 0.1 Number of Support Vectors: 1555 ( 645 910 ) Number of Classes: 2 Levels: nonspam spam We can test the model against our test dataset and look at the results as follows: > pred <- predict(model, testset) > table(pred, testset$type) pred     nonspam spam nonspam     891 104 spam         38 500 Note, the e1071 package is not compatible with the current version of R. Given its usefulness I would expect the package to be updated to support the user base. So, using SVM, we have a 90 percent ((891+500) / (891+104+38+500)) accuracy rate of prediction. Bayesian learning With Bayesian learning, we have an initial premise in a model that is adjusted with new information. We can use the MCMCregress method in the MCMCpack package to use Bayesian regression on learning data and apply the model against test data. Let's load the MCMCpack package as follows: > install.packages("MCMCpack") > library(MCMCpack) We are going to be using the transplant data on transplants available at http://lib.stat.cmu.edu/datasets/stanford. (The dataset on the site is part of the web page, so I copied into a local CSV file.) The data shows expected transplant success factor, the actual transplant success factor, and the number of transplants over a time period. So, there is a good progression over time as to the success of the program. We can read the dataset as follows: > transplants <- read.csv("transplant.csv") > summary(transplants)    expected         actual       transplants   Min.   : 0.057   Min.   : 0.000   Min.   : 1.00 1st Qu.: 0.722   1st Qu.: 0.500   1st Qu.: 9.00 Median : 1.654   Median : 2.000   Median : 18.00 Mean   : 2.379   Mean   : 2.382   Mean   : 27.83 3rd Qu.: 3.402   3rd Qu.: 3.000   3rd Qu.: 40.00 Max.   :12.131   Max.   :18.000   Max.   :152.00 We use Bayesian regression against the data— note that we are modifying the model as we progress with new information using the MCMCregress function. The MCMCregress function looks like this: MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000,    thin = 1, verbose = 0, seed = NA, beta.start = NA,    b0 = 0, B0 = 0, c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA,    marginal.likelihood = c("none", "Laplace", "Chib95"), ...) The various parameters of the MCMCregress function are described in the following table: Parameter Description formula This is the formula of model data This is the dataset to be used for model … These are the additional parameters for the function Let's use the Bayesian regression against the data as follows: > model <- MCMCregress(expected ~ actual + transplants, data=transplants) > summary(model) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable,    plus standard error of the mean:                Mean     SD Naive SE Time-series SE (Intercept) 0.00484 0.08394 0.0008394     0.0008388 actual     0.03413 0.03214 0.0003214     0.0003214 transplants 0.08238 0.00336 0.0000336     0.0000336 sigma2     0.44583 0.05698 0.0005698     0.0005857 2. Quantiles for each variable:                2.5%     25%     50%     75%   97.5% (Intercept) -0.15666 -0.05216 0.004786 0.06092 0.16939 actual     -0.02841 0.01257 0.034432 0.05541 0.09706 transplants 0.07574 0.08012 0.082393 0.08464 0.08890 sigma2       0.34777 0.40543 0.441132 0.48005 0.57228 The plot of the data shows the range of results, as shown in the following graph. Look at this in contrast to a simple regression with one result. > plot(model) Random forests Random forests is an algorithm that constructs a multitude of decision trees for the model of the data and selects the best of the lot as the final result. We can use the randomForest function in the kernlab package for this function. The randomForest function looks like this: randomForest(formula, data=NULL, ..., subset, na.action=na.fail) The various parameters of the randomForest function are described in the following table: Parameter Description formula This is the formula of model data This is the dataset to be used subset This is the subset of the dataset to be used na.action This is the action to take with NA values For an example of random forest, we will use the spam data, as in the section Support vector machines. First, let's load the package and library as follows: > install.packages("randomForest") > library(randomForest) Now, we will generate the model with the following command (this may take a while): > fit <- randomForest(type ~ ., data=spam) Let's look at the results to see how it went: > fit Call: randomForest(formula = type ~ ., data = spam)                Type of random forest: classification                      Number of trees: 500 No. of variables tried at each split: 7        OOB estimate of error rate: 4.48% Confusion matrix:         nonspam spam class.error nonspam   2713   75 0.02690100 spam       131 1682 0.07225593 We can look at the relative importance of the data variables in the final model, as shown here: > head(importance(fit))        MeanDecreaseGini make           7.967392 address       12.654775 all           25.116662 num3d           1.729008 our           67.365754 over           17.579765 Ordering the data shows a couple of the factors to be critical to the determination. For example, the presence of the exclamation character in the e-mail is shown as a dominant indicator of spam mail: charExclamation   256.584207 charDollar       200.3655348 remove           168.7962949 free              142.8084662 capitalAve       137.1152451 capitalLong       120.1520829 your             116.6134519 Unsupervised learning With unsupervised learning, we do not have a target variable. We have a number of predictor variables that we look into to determine if there is a pattern. We will go over the following unsupervised learning techniques: Cluster analysis Density estimation Expectation-maximization algorithm Hidden Markov models Blind signal separation Cluster analysis Cluster analysis is the process of organizing data into groups (clusters) that are similar to each other. For our example, we will use the wheat seed data available at http://www.uci.edu, as shown here: > wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep="t") Let's look at the raw data: > head(wheat) X15.26 X14.84 X0.871 X5.763 X3.312 X2.221 X5.22 X1 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1 We need to apply column names so we can see the data better: > colnames(wheat) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined") > head(wheat)    area perimeter compactness length width asymmetry groove undefined 1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956         1 2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825         1 3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805         1 4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175         1 5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956         1 6 14.69     14.49     0.8799 5.563 3.259     3.586 5.219         1 The last column is not defined in the data description, so I am removing it: > wheat <- subset(wheat, select = -c(undefined) ) > head(wheat)    area perimeter compactness length width asymmetry groove 1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956 2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825 3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805 4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175 5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956 6 14.69    14.49     0.8799 5.563 3.259     3.586 5.219 Now, we can finally produce the cluster using the kmeans function. The kmeans function looks like this: kmeans(x, centers, iter.max = 10, nstart = 1,        algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",                      "MacQueen"), trace=FALSE) The various parameters of the kmeans function are described in the following table: Parameter Description x This is the dataset centers This is the number of centers to coerce data towards … These are the additional parameters of the function Let's produce the cluster using the kmeans function: > fit <- kmeans(wheat, 5) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Unfortunately, there are some rows with missing data, so let's fix this using the following command: > wheat <- wheat[complete.cases(wheat),] Let's look at the data to get some idea of the factors using the following command: > plot(wheat) If we try looking at five clusters, we end up with a fairly good set of clusters with an 85 percent fit, as shown here: > fit <- kmeans(wheat, 5) > fit K-means clustering with 5 clusters of sizes 29, 33, 56, 69, 15 Cluster means:      area perimeter compactness   length   width asymmetry   groove 1 16.45345 15.35310   0.8768000 5.882655 3.462517 3.913207 5.707655 2 18.95455 16.38879   0.8868000 6.247485 3.744697 2.723545 6.119455 3 14.10536 14.20143   0.8777750 5.480214 3.210554 2.368075 5.070000 4 11.94870 13.27000   0.8516652 5.229304 2.870101 4.910145 5.093333 5 19.58333 16.64600   0.8877267 6.315867 3.835067 5.081533 6.144400 Clustering vector: ... Within cluster sum of squares by cluster: [1] 48.36785 30.16164 121.63840 160.96148 25.81297 (between_SS / total_SS = 85.4 %) If we push to 10 clusters, the performance increases to 92 percent. Density estimation Density estimation is used to provide an estimate of the probability density function of a random variable. For this example, we will use sunspot data from Vincent arlbuck site. Not clear if sunspots are truly random. Let's load our data as follows: > sunspots <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.month.csv") > summary(sunspots)        X             time     sunspot.month   Min.   :   1   Min.   :1749   Min.   : 0.00 1st Qu.: 795   1st Qu.:1815   1st Qu.: 15.70 Median :1589   Median :1881   Median : 42.00 Mean   :1589   Mean   :1881   Mean   : 51.96 3rd Qu.:2383   3rd Qu.:1948   3rd Qu.: 76.40 Max.   :3177   Max.   :2014   Max.   :253.80 > head(sunspots) X     time sunspot.month 1 1 1749.000         58.0 2 2 1749.083         62.6 3 3 1749.167         70.0 4 4 1749.250         55.7 5 5 1749.333         85.0 6 6 1749.417        83.5 We will now estimate the density using the following command: > d <- density(sunspots$sunspot.month) > d Call: density.default(x = sunspots$sunspot.month) Data: sunspots$sunspot.month (3177 obs.); Bandwidth 'bw' = 7.916        x               y           Min.   :-23.75   Min.   :1.810e-07 1st Qu.: 51.58   1st Qu.:1.586e-04 Median :126.90   Median :1.635e-03 Mean   :126.90   Mean   :3.316e-03 3rd Qu.:202.22   3rd Qu.:5.714e-03 Max.   :277.55   Max.   :1.248e-02 A plot is very useful for this function, so let's generate one using the following command: > plot(d) It is interesting to see such a wide variation; maybe the data is pretty random after all. We can use the density to estimate additional periods as follows: > N<-1000 > sunspots.new <- rnorm(N, sample(sunspots$sunspot.month, size=N, replace=TRUE)) > lines(density(sunspots.new), col="blue") It looks like our density estimate is very accurate. Expectation-maximization Expectation-maximization (EM) is an unsupervised clustering approach that adjusts the data for optimal values. When using EM, we have to have some preconception of the shape of the data/model that will be targeted. This example reiterates the example on the Wikipedia page, with comments. The example tries to model the iris species from the other data points. Let's load the data as shown here: > iris <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") > colnames(iris) <- c("SepalLength","SepalWidth","PetalLength","PetalWidth","Species") > modelName = "EEE" Each observation has sepal length, width, petal length, width, and species, as shown here: > head(iris) SepalLength SepalWidth PetalLength PetalWidth     Species 1         5.1       3.5         1.4       0.2 Iris-setosa 2         4.9       3.0         1.4       0.2 Iris-setosa 3         4.7       3.2         1.3       0.2 Iris-setosa 4         4.6       3.1         1.5       0.2 Iris-setosa 5         5.0       3.6         1.4       0.2 Iris-setosa 6         5.4       3.9         1.7       0.4 Iris-setosa We are estimating the species from the other points, so let's separate the data as follows: > data = iris[,-5] > z = unmap(iris[,5]) Let's set up our mstep for EM, given the data, categorical data (z) relating to each data point, and our model type name: > msEst <- mstep(modelName, data, z) We use the parameters defined in the mstep to produce our model, as shown here: > em(modelName, data, msEst$parameters) $z                [,1]         [,2]         [,3] [1,] 1.000000e+00 4.304299e-22 1.699870e-42 … [150,] 8.611281e-34 9.361398e-03 9.906386e-01 $parameters$pro [1] 0.3333333 0.3294048 0.3372619 $parameters$mean              [,1]     [,2]     [,3] SepalLength 5.006 5.941844 6.574697 SepalWidth 3.418 2.761270 2.980150 PetalLength 1.464 4.257977 5.538926 PetalWidth 0.244 1.319109 2.024576 $parameters$variance$d [1] 4 $parameters$variance$G [1] 3 $parameters$variance$sigma , , 1            SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165 , , 2 , , 3 … (there was little difference in the 3 sigma values) Covariance $parameters$variance$Sigma            SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165 $parameters$variance$cholSigma             SepalLength SepalWidth PetalLength PetalWidth SepalLength -0.5136316 -0.1758161 -0.32980960 -0.07665323 SepalWidth   0.0000000 0.2856706 -0.02326832 0.06072001 PetalLength   0.0000000 0.0000000 -0.27735855 -0.06477412 PetalWidth   0.0000000 0.0000000 0.00000000 0.16168899 attr(,"info") iterations       error 4.000000e+00 1.525131e-06 There is quite a lot of output from the em function. The highlights for me were the three sigma ranges were the same and the error from the function was very small. So, I think we have a very good estimation of species using just the four data points. Hidden Markov models The hidden Markov models (HMM) is the idea of observing data assuming it has been produced by a Markov model. The problem is to discover what that model is. I am using the Python example on Wikipedia for HMM. For an HMM, we need states (assumed to be hidden from observer), symbols, transition matrix between states, emission (output) states, and probabilities for all. The Python information presented is as follows: states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = {    'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},    'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},    } emission_probability = {    'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},    'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},    } trans <- matrix(c('Rainy', : {'Rainy': 0.7, 'Sunny': 0.3},    'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},    } We convert these to use in R for the initHmm function by using the following command: > hmm <- initHMM(c("Rainy","Sunny"), c('walk', 'shop', 'clean'), c(.6,.4), matrix(c(.7,.3,.4,.6),2), matrix(c(.1,.4,.5,.6,.3,.1),3)) > hmm $States [1] "Rainy" "Sunny" $Symbols [1] "walk" "shop" "clean" $startProbs Rainy Sunny 0.6   0.4 $transProbs        to from   Rainy Sunny Rainy   0.7   0.4 Sunny   0.3   0.6 $emissionProbs        symbols states walk shop clean Rainy 0.1 0.5   0.3 Sunny 0.4 0.6   0.1 The model is really a placeholder for all of the setup information needed for HMM. We can then use the model to predict based on observations, as follows: > future <- forward(hmm, c("walk","shop","clean")) > future        index states         1         2         3 Rainy -2.813411 -3.101093 -4.139551 Sunny -1.832581 -2.631089 -5.096193 The result is a matrix of probabilities. For example, it is more likely to be Sunny when we observe walk. Blind signal separation Blind signal separation is the process of identifying sources of signals from a mixed signal. Primary component analysis is one method of doing this. An example is a cocktail party where you are trying to listen to one speaker. For this example, I am using the decathlon dataset in the FactoMineR package, as shown here: > library(FactoMineR) > data(decathlon) Let's look at the data to get some idea of what is available: > summary(decathlon) 100m           Long.jump     Shot.put       High.jump Min.   :10.44   Min.   :6.61   Min.   :12.68   Min.   :1.850 1st Qu.:10.85   1st Qu.:7.03   1st Qu.:13.88   1st Qu.:1.920 Median :10.98   Median :7.30   Median :14.57   Median :1.950 Mean   :11.00   Mean   :7.26   Mean   :14.48   Mean   :1.977 3rd Qu.:11.14   3rd Qu.:7.48   3rd Qu.:14.97   3rd Qu.:2.040 Max.   :11.64   Max.   :7.96   Max.   :16.36   Max.   :2.150 400m           110m.hurdle       Discus       Pole.vault   Min.   :46.81   Min.   :13.97   Min.   :37.92   Min.   :4.200 1st Qu.:48.93   1st Qu.:14.21   1st Qu.:41.90   1st Qu.:4.500 Median :49.40   Median :14.48   Median :44.41   Median :4.800 Mean   :49.62   Mean   :14.61 Mean   :44.33   Mean   :4.762 3rd Qu.:50.30   3rd Qu.:14.98   3rd Qu.:46.07   3rd Qu.:4.920 Max.   :53.20   Max.   :15.67   Max.   :51.65   Max.   :5.400 Javeline       1500m           Rank           Points   Min.   :50.31   Min.   :262.1   Min.   : 1.00   Min.   :7313 1st Qu.:55.27   1st Qu.:271.0   1st Qu.: 6.00   1st Qu.:7802 Median :58.36   Median :278.1   Median :11.00   Median :8021 Mean   :58.32   Mean   :279.0   Mean   :12.12   Mean   :8005 3rd Qu.:60.89   3rd Qu.:285.1   3rd Qu.:18.00   3rd Qu.:8122 Max.   :70.52   Max.   :317.0   Max.   :28.00   Max.   :8893    Competition Decastar:13 OlympicG:28 The output looks like performance data from a series of events at a track meet: > head(decathlon)        100m   Long.jump Shot.put High.jump 400m 110m.hurdle Discus SEBRLE 11.04     7.58   14.83     2.07 49.81       14.69 43.75 CLAY   10.76     7.40   14.26     1.86 49.37       14.05 50.72 KARPOV 11.02     7.30   14.77     2.04 48.37       14.09 48.95 BERNARD 11.02     7.23   14.25     1.92 48.93       14.99 40.87 YURKOV 11.34     7.09   15.19     2.10 50.42       15.31 46.26 WARNERS 11.11     7.60   14.31     1.98 48.68       14.23 41.10        Pole.vault Javeline 1500m Rank Points Competition SEBRLE       5.02   63.19 291.7   1   8217   Decastar CLAY         4.92   60.15 301.5   2   8122   Decastar KARPOV       4.92   50.31 300.2   3   8099   Decastar BERNARD       5.32   62.77 280.1   4   8067   Decastar YURKOV       4.72   63.44 276.4   5   8036   Decastar WARNERS       4.92   51.77 278.1   6   8030   Decastar Further, this is performance of specific individuals in track meets. We run the PCA function by passing the dataset to use, whether to scale the data or not, and the type of graphs: > res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T) This produces two graphs: Individual factors map Variables factor map The individual factors map lays out the performance of the individuals. For example, we see Karpov who is high in both dimensions versus Bourginon who is performing badly (on the left in the following chart): The variables factor map shows the correlation of performance between events. For example, doing well in the 400 meters run is negatively correlated with the performance in the long jump; if you did well in one, you likely did well in the other as well. Here is the variables factor map of our data: Questions Factual Which supervised learning technique(s) do you lean towards as your "go to" solution? Why are the density plots for Bayesian results off-center? When, how, and why? How would you decide on the number of clusters to use? Find a good rule of thumb to decide the number of hidden layers in a neural net. Challenges Investigate other blind signal separation techniques, such as ICA. Use other methods, such as poisson, in the rpart function (especially if you have a natural occurring dataset). Summary In this article, we looked into various methods of machine learning, including both supervised and unsupervised learning. With supervised learning, we have a target variable we are trying to estimate. With unsupervised, we only have a possible set of predictor variables and are looking for patterns. In supervised learning, we looked into using a number of methods, including decision trees, regression, neural networks, support vector machines, and Bayesian learning. In unsupervised learning, we used cluster analysis, density estimation, hidden Markov models, and blind signal separation. Resources for Article: Further resources on this subject: Machine Learning in Bioinformatics [article] Data visualization [article] Introduction to S4 Classes [article]
Read more
  • 0
  • 0
  • 2731

article-image-navigation-mesh-generation
Packt
19 Dec 2014
9 min read
Save for later

Navigation Mesh Generation

Packt
19 Dec 2014
9 min read
In this article by Curtis Bennett and Dan Violet Sagmiller, authors of the book Unity AI Programming Essentials, we will learn about navigation meshes in Unity. Navigation mesh generation controls how AI characters are able to travel around a game level and is one of the most important topics in game AI. In this article, we will provide an overview of navigation meshes and look at the algorithm for generating them. Then, we'll look at different options of customizing our navigation meshes better. To do this, we will be using RAIN 2.1.5, a popular AI plugin for Unity by Rival Theory, available for free at http://rivaltheory.com/rain/download/. In this article, you will learn about: How navigation mesh generation works and the algorithm behind it Advanced options for customizing navigation meshes Creating advanced navigation meshes with RAIN (For more resources related to this topic, see here.) An overview of a navigation mesh To use navigation meshes, also referred to as NavMeshes, effectively the first things we need to know are what exactly navigation meshes are and how they are created. A navigation mesh is a definition of the area an AI character could travel to in a level. It is a mesh, but it is not intended to be rendered or seen by the player, instead it is used by the AI system. A NavMesh usually does not cover all the area in a level (if it did we wouldn't need one) since it's just the area a character can walk. The mesh is also almost always a simplified version of the geometry. For instance, you could have a cave floor in a game with thousands of polygons along the bottom showing different details in the rock, but for the navigation mesh the areas would just be a handful of very large polys giving a simplified view of the level. The purpose of navigation mesh is to provide this simplified representation to the rest of the AI system a way to find a path between two points on a level for a character. This is its purpose; let's discuss how they are created. It used to be a common practice in the games industry to create navigation meshes manually. A designer or artist would take the completed level geometry and create one using standard polygon mesh modelling tools and save it out. As you might imagine, this allowed for nice, custom, efficient meshes, but was also a big time sink, since every time the level changed the navigation mesh would need to be manually edited and updated. In recent years, there has been more research in automatic navigation mesh generation. There are many approaches to automatic navigation mesh generation, but the most popular is Recast, originally developed and designed by Mikko Monomen. Recast takes in level geometry and a set of parameters defining the character, such as the size of the character and how big of steps it can take, and then does a multipass approach to filter and create the final NavMesh. The most important phase of this is voxelizing the level based on an inputted cell size. This means the level geometry is divided into voxels (cubes) creating a version of the level geometry where everything is partitioned into different boxes called cells. Then the geometry in each of these cells is analyzed and simplified based on its intersection with the sides of the boxes and is culled based on things such as the slope of the geometry or how big a step height is between geometry. This simplified geometry is then merged and triangulated to make a final navigation mesh that can be used by the AI system. The source code and more information on the original C++ implementation of Recast is available at https://github.com/memononen/recastnavigation. Advanced NavMesh parameters Now that we understand how navigation mesh generations works, let's look at the different parameters you can set to generate them in more detail. We'll look at how to do these with RAIN: Open Unity and create a new scene and a floor and some blocks for walls. Download RAIN from http://rivaltheory.com/rain/download/ and import it into your scene. Then go to RAIN | Create Navigation Mesh. Also right-click on the RAIN menu and choose Show Advanced Settings. The setup should look something like the following screenshot: Now let's look at some of the important parameters: Size: This is the overall size of the navigation mesh. You'll want the navigation mesh to cover your entire level and use this parameter instead of trying to scale up the navigation mesh through the Scale transform in the Inspector window. For our demo here, set the Size parameter to 20. Walkable Radius: This is an important parameter to define the character size of the mesh. Remember, each mesh will be matched to the size of a particular character, and this is the radius of the character. You can visualize the radius for a character by adding a Unity Sphere Collider script to your object (by going to Component | Physics | Sphere Collider) and adjusting the radius of the collider. Cell Size: This is also a very important parameter. During the voxel step of the Recast algorithm, this sets the size of the cubes to inspect the geometry. The smaller the size, the more detailed and finer mesh, but longer the processing time for Recast. A large cell size makes computation fast but loses detail. For example, here is a NavMesh from our demo with a cell size of 0.01: You can see the finer detail here. Here is the navigation mesh generated with a cell size of 0.1: Note the difference between the two screenshots. In the former, walking through the two walls lower down in our picture is possible, but in the latter with a larger cell size, there is no path even though the character radius is the same. Problems like this become greater with larger cell sizes. The following is a navigation mesh with a cell size of 1: As you can see, the detail becomes jumbled and the mesh itself becomes unusable. With such differing results, the big question is how large should a cell size be for a level? The answer is that it depends on the required result. However, one important consideration is that as the processing time to generate one is done during development and not at runtime even if it takes several minutes to generate a good mesh, it can be worth it to get a good result in the game. Setting a small cell size on a large level can cause mesh processing to take a significant amount of time and consume a lot of memory. It is a good practice to save the scene before attempting to generate a complex navigation mesh. The Size, Walkable Radius, and Cell Size parameters are the most important parameters when generating the navigation mesh, but there are more that are used to customize the mesh further: Max Slope: This is the largest slope that a character can walk on. This is how much a piece of geometry that is tilted can still be walked on. If you take the wall and rotate it, you can see it is walkable: The preceding is a screenshot of a walkable object with slope. Step Height: This is how high a character can step from one object to another. For example, if you have steps between two blocks, as shown in the following screenshot, this would define how far in height the blocks can be apart and whether the area is still considered walkable: This is a screenshot of the navigation mesh with step height set to connect adjacent blocks. Walkable Height: This is the vertical height that is needed for the character to walk. For example, in the previous illustration, the second block is not walkable underneath because of the walkable height. If you raise it to a least one unit off the ground and set the walkable height to 1, the area underneath would become walkable:   You can see a screenshot of the navigation mesh with walkable height set to allow going under the higher block. These are the most important parameters. There are some other parameters related to the visualization and to cull objects. We will look at culling more in the next section. Culling areas Being able to set up areas as walkable or not is an important part of creating a level. To demo this, let's divide the level into two parts and create a bridge between the two. Take our demo and duplicate the floor and pull it down. Then transform one of the walls to a bridge. Then, add two other pieces of geometry to mark areas that are dangerous to walk on, like lava. Here is an example setup: This is a basic scene with a bridge to cross. If you recreate the navigation mesh now, all of the geometry will be covered and the bridge won't be recognized. To fix this, you can create a new tag called Lava and tag the geometry under the bridge with it. Then, in the navigation meshes' RAIN component, add Lava to the unwalkable tags. If you then regenerate the mesh, only the bridge is walkable. This is a screenshot of a navigation mesh areas under bridge culled: Using layers and the walkable tag you can customize navigation meshes. Summary Navigation meshes are an important part of game AI. In this article, we looked at the different parameters to customize navigation meshes. We looked at things such as setting the character size and walkable slopes and discussed the importance of the cell size parameter. We then saw how to customize our mesh by tagging different areas as not walkable. This should be a good start for designing navigation meshes for your games. Resources for Article: Further resources on this subject: Components in Unity [article] Enemy and Friendly AIs [article] Introduction to AI [article]
Read more
  • 0
  • 0
  • 4378

article-image-evolving-data-model
Packt
19 Dec 2014
11 min read
Save for later

Evolving the data model

Packt
19 Dec 2014
11 min read
In this article by C. Y. Kan, author of the book Cassandra Data Modeling and Analysis, we will see the techniques of how to evolve an existing Cassandra data model in detail. Meanwhile, the techniques of modeling by query will be demonstrated as well. (For more resources related to this topic, see here.) The Stock Screener Application is good enough to retrieve and analyze a single stock at one time. However, scanning just a single stock looks very limited in practical use. A slight improvement can be made here; it can handle a bunch of stocks instead of one. This bunch of stocks will be stored as Watch List in the Cassandra database. Accordingly, the Stock Screener Application will be modified to analyze the stocks in the Watch List, and therefore it will produce alerts for each of the stocks being watched based on the same screening rule. For the produced alerts, saving them in Cassandra will be beneficial for backtesting trading strategies and continuous improvement of the Stock Screener Application. They can be reviewed from time to time without having to review them on the fly. Backtesting is a jargon used to refer to testing a trading strategy, investment strategy, or a predictive model using existing historical data. It is also a special type of cross-validation applied to time series data. In addition, when the number of the stocks in the Watch List grows to a few hundred, it will be difficult for a user of the Stock Screener Application to recall what the stocks are by simply referring to their stock codes. Hence, it would be nice to have the name of the stocks added to the produced alerts to make them more descriptive and user-friendly. Finally, we might have an interest in finding out how many alerts were generated on a particular stock over a specified period of time and how many alerts were generated on a particular date. We will use CQL to write queries to answer these two questions. By doing so, the modeling by query technique can be demonstrated. The enhancement approach The enhancement approach consists of four change requests in total. First, we will conduct changes in the data model and then the code will be enhanced to provide the new features. Afterwards, we will test run the enhanced Stock Screener Application again. The parts of the Stock Screener Application that require modifications are highlighted in the following figure. It is remarkable that two new components are added to the Stock Screener Application. The first component, Watch List, governs Data Mapper and Archiver to collect stock quote data of those stocks in the Watch List from Yahoo! Finance. The second component is Query. It provides two Queries on Alert List for backtesting purposes: Watch List Watch List is a very simple table that merely stores the stock code of its constituents. It is rather intuitive for a relational database developer to define the stock code as the primary key, isn't it? Nevertheless, remember that in Cassandra, the primary key is used to determine the node that stores the row. As Watch List is expected to not be a very long list, it would be more appropriate to put all of its rows on the same node for faster retrieval. But how can we do that? We can create an additional column, say watch_list_code, for this particular purpose. The new table is called watchlist and will be created in the packtcdma keyspace. The CQL statement is shown in chapter06_001.py: # -*- coding: utf-8 -*- # program: chapter06_001.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create watchlist def create_watchlist(ss):    ## create watchlist table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS watchlist (' +                'watch_list_code varchar,' +                'symbol varchar,' +                'PRIMARY KEY (watch_list_code, symbol))')       ## insert AAPL, AMZN, and GS into watchlist    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'AAPL')")    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'AMZN')")    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'GS')") ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create watchlist table create_watchlist(session) ## close Cassandra connection cluster.shutdown() The create_watchlist function creates the table. Note that the watchlist table has a compound primary key made of watch_list_code and symbol. A Watch List called WS01 is also created, which contains three stocks, AAPL, AMZN, and GS. Alert List It is produced by a Python program and enumerates the date when the close price was above its 10-day SMA, that is, the signal and the close price at that time. Note that there were no stock code and stock name. We will create a table called alertlist to store the alerts with the code and name of the stock. The inclusion of the stock name is to meet the requirement of making the Stock Screener Application more user-friendly. Also, remember that joins are not allowed and denormalization is really the best practice in Cassandra. This means that we do not mind repeatedly storing (duplicating) the stock name in the tables that will be queried. A rule of thumb is one table for one query; as simple as that. The alertlist table is created by the CQL statement, as shown in chapter06_002.py: # -*- coding: utf-8 -*- # program: chapter06_002.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create alertlist def create_alertlist(ss):    ## execute CQL statement to create alertlist table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS alertlist (' +                'symbol varchar,' +                'price_time timestamp,' +                'stock_name varchar,' +                'signal_price float,' +                'PRIMARY KEY (symbol, price_time))') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create alertlist table create_alertlist(session) ## close Cassandra connection cluster.shutdown() The primary key is also a compound primary key that consists of symbol and price_time. Adding the descriptive stock name Until now, the packtcdma keyspace has three tables, which are alertlist, quote, and watchlist. To add the descriptive stock name, one can think of only adding a column of stock name to alertlist only. As seen in the previous section, this has been done. So, do we need to add a column for quote and watchlist? It is, in fact, a design decision that depends on whether these two tables will be serving user queries. What a user query means is that the table will be used to retrieve rows for a query raised by a user. If a user wants to know the close price of Apple Inc. on June 30, 2014, it is a user query. On the other hand, if the Stock Screener Application uses a query to retrieve rows for its internal processing, it is not a user query. Therefore, if we want quote and watchlist to return rows for user queries, they need the stock name column; otherwise, they do not need it. The watchlist table is only for internal use by the current design, and so it need not have the stock name column. Of course, if in future, the Stock Screener Application allows a user to maintain Watch List, the stock name should also be added to the watchlist table. However, for quote, it is a bit tricky. As the stock name should be retrieved from the Data Feed Provider, which is Yahoo! Finance in our case, the most suitable time to get it is when the corresponding stock quote data is retrieved. Hence, a new column called stock_name is added to quote, as shown in chapter06_003.py: # -*- coding: utf-8 -*- # program: chapter06_003.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to add stock_name column def add_stockname_to_quote(ss):    ## add stock_name to quote    ss.execute('ALTER TABLE quote ' +                'ADD stock_name varchar') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## add stock_name column add_stockname_to_quote(session) ## close Cassandra connection cluster.shutdown() It is quite self-explanatory. Here, we use the ALTER TABLE statement to add the stock_name column of the varchar data type to quote. Queries on alerts As mentioned previously, we are interested in two questions: How many alerts were generated on a stock over a specified period of time? How many alerts were generated on a particular date? For the first question, alertlist is sufficient to provide an answer. However, alertlist cannot answer the second question because its primary key is composed of symbol and price_time. We need to create another table specifically for that question. This is an example of modeling by query. Basically, the structure of the new table for the second question should resemble the structure of alertlist. We give that table a name, alert_by_date, and create it as shown in chapter06_004.py: # -*- coding: utf-8 -*- # program: chapter06_004.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create alert_by_date table def create_alertbydate(ss):    ## create alert_by_date table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS alert_by_date (' +               'symbol varchar,' +                'price_time timestamp,' +                'stock_name varchar,' +                'signal_price float,' +                'PRIMARY KEY (price_time, symbol))') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create alert_by_date table create_alertbydate(session) ## close Cassandra connection cluster.shutdown() When compared to alertlist in chapter06_002.py, alert_by_date only swaps the order of the columns in the compound primary key. One might think that a secondary index can be created on alertlist to achieve the same effect. Nonetheless, in Cassandra, a secondary index cannot be created on columns that are already engaged in the primary key. Always be aware of this constraint. We now finish the modifications on the data model. It is time for us to enhance the application logic in the next section. Summary This article extends the Stock Screener Application by a number of enhancements. We made changes to the data model to demonstrate the modeling by query techniques and how denormalization can help us achieve a high-performance application. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] About Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article]
Read more
  • 0
  • 0
  • 2689

article-image-lookups
Packt
17 Dec 2014
24 min read
Save for later

Mastering Splunk: Lookups

Packt
17 Dec 2014
24 min read
In this article, by James Miller, author of the book Mastering Splunk, we will discuss Splunk lookups and workflows. The topics that will be covered in this article are as follows: The value of a lookup Design lookups File lookups Script lookups (For more resources related to this topic, see here.) Lookups Machines constantly generate data, usually in a raw form that is most efficient for processing by machines, but not easily understood by "human" data consumers. Splunk has the ability to identify unique identifiers and/or result or status codes within the data. This gives you the ability to enhance the readability of the data by adding descriptions or names as new search result fields. These fields contain information from an external source such as a static table (a CSV file) or the dynamic result of a Python command or a Python-based script. Splunk's lookups can use information within returned events or time information to determine how to add other fields from your previously defined external data sources. To illustrate, here is an example of a Splunk static lookup that: Uses the Business Unit value in an event Matches this value with the organization's business unit name in a CSV file Adds the definition to the event (as the Business Unit Name field) So, if you have an event where the Business Unit value is equal to 999999, the lookup will add the Business Unit Name value as Corporate Office to that event. More sophisticated lookups can: Populate a static lookup table from the results of a report. Use a Python script (rather than a lookup table) to define a field. For example, a lookup can use a script to return a server name when given an IP address. Perform a time-based lookup if your lookup table includes a field value that represents time. Let's take a look at an example of a search pipeline that creates a table based on IBM Cognos TM1 file extractions: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2" as "Account" "451200" as "Activity" | eval RFCST= round(FCST) |Table Month, "Business Unit", RFCST The following table shows the results generated:   Now, add the lookup command to our search pipeline to have Splunk convert Business Unit into Business Unit Name: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2" as "Account" "451200"as "Activity" | eval RFCST= round(FCST) |lookup BUtoBUName BU as "Business Unit" OUTPUT BUName as "Business Unit Name" | Table Month, "Business Unit", "Business Unit Name", RFCST The lookup command in our Splunk search pipeline will now add Business Unit Name in the results table:   Configuring a simple field lookup In this section, we will configure a simple Splunk lookup. Defining lookups in Splunk Web You can set up a lookup using the Lookups page (in Splunk Web) or by configuring stanzas in the props.conf and transforms.conf files. Let's take the easier approach first and use the Splunk Web interface. Before we begin, we need to establish our lookup table that will be in the form of an industry standard comma separated file (CSV). Our example is one that converts business unit codes to a more user-friendly business unit name. For example, we have the following information: Business unit code Business unit name 999999 Corporate office VA0133SPS001 South-western VA0133NLR001 North-east 685470NLR001 Mid-west In the events data, only business unit codes are included. In an effort to make our Splunk search results more readable, we want to add the business unit name to our results table. To do this, we've converted our information (shown in the preceding table) to a CSV file (named BUtoBUName.csv):   For this example, we've kept our lookup table simple, but lookup tables (files) can be as complex as you need them to be. They can have numerous fields (columns) in them. A Splunk lookup table has a few requirements, as follows: A table must contain a minimum of two columns Each of the columns in the table can have duplicate values You should use (plain) ASCII text and not non-UTF-8 characters Now, from Splunk Web, we can click on Settings and then select Lookups:   From the Lookups page, we can select Lookup table files:   From the Lookup table files page, we can add our new lookup file (BUtoBUName.csv):   By clicking on the New button, we see the Add new page where we can set up our file by doing the following: Select a Destination app (this is a drop-down list and you should select Search). Enter (or browse to) our file under Upload a lookup file. Provide a Destination filename. Then, we click on Save:   Once you click on Save, you should receive the Successfully saved "BUtoBUName" in search" message:   In the previous screenshot, the lookup file is saved by default as private. You will need to adjust permissions to allow other Splunk users to use it. Going back to the Lookups page, we can select Lookup definitions to see the Lookup definitions page:   In the Lookup definitions page, we can click on New to visit the Add new page (shown in the following screenshot) and set up our definition as follows: Destination app: The lookup will be part of the Splunk search app Name: Our file is BUtoBUName Type: Here, we will select File-based Lookup file: The filename is ButoBUName.csv, which we uploaded without the .csv suffix Again, we should see the Successfully saved "BUtoBUName" in search message:   Now, our lookup is ready to be used: Automatic lookups Rather than having to code for a lookup in each of your Splunk searches, you have the ability to configure automatic lookups for a particular source type. To do this from Splunk Web, we can click on Settings and then select Lookups:   From the Lookups page, click on Automatic lookups:   In the Automatic lookups page, click on New:   In the Add New page, we will fill in the required information to set up our lookup: Destination app: For this field, some options are framework, launcher, learned, search, and splunk_datapreview (for our example, select search). Name: This provide a user-friendly name that describes this automatic lookup. Lookup table: This is the name of the lookup table you defined with a CSV file (discussed earlier in this article). Apply to: This is the type that you want this automatic lookup to apply to. The options are sourcetype, source, or host (I've picked sourcetype). Named: This is the name of the type you picked under Apply to. I want my automatic search to apply for all searches with the sourcetype of csv. Lookup input fields: This is simple in my example. In my lookup table, the field to be searched on will be BU and the = field value will be the field in the event results that I am converting; in my case, it was the field 650693NLR001. Lookup output fields: This will be the field in the lookup table that I am using to convert to, which in my example is BUName and I want to call it Business Unit Name, so this becomes the = field value. Overwrite field values: This is a checkbox where you can tell Splunk to overwrite existing values in your output fields—I checked it. The Add new page The Splunk Add new page (shown in the following screenshot) is where you enter the lookup information (detailed in the previous section):   Once you have entered your automatic lookup information, you can click on Save and you will receive the Successfully saved "Business Unit to Business Unit Name" in search message:   Now, we can use the lookup in a search. For example, you can run a search with sourcetype=csv, as follows: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2"as "Account" "451200" as "Activity" | eval RFCST= round(FCST) |Table "Business Unit", "Business Unit Name", Month, RFCST Notice in the following screenshot that Business Unit Name is converted to the user-friendly values from our lookup table, and we didn't have to add the lookup command to our search pipeline:   Configuration files In addition to using the Splunk web interface, you can define and configure lookups using the following files: props.conf transforms.conf To set up a lookup with these files (rather than using Splunk web), we can perform the following steps: Edit transforms.conf to define the lookup table. The first step is to edit the transforms.conf configuration file to add the new lookup reference. Although the file exists in the Splunk default folder ($SPLUNK_HOME/etc/system/default), you should edit the file in $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/ (if the file doesn't exist here, create it). Whenever you edit a Splunk .conf file, always edit a local version, keeping the original (system directory version) intact. In the current version of Splunk, there are two types of lookup tables: static and external. Static lookups use CSV files, and external (which are dynamic) lookups use Python scripting. You have to decide if your lookup will be static (in a file) or dynamic (use script commands). If you are using a file, you'll use filename; if you are going to use a script, you use external_cmd (both will be set in the transforms.conf file). You can also limit the number of matching entries to apply to an event by setting the max_matches option (this tells Splunk to use the first <integer> (in file order) number of entries). I've decided to leave the default for max_matches, so my transforms.conf file looks like the following: [butobugroup]filename = butobugroup.csv This step is optional. Edit props.conf to apply your lookup table automatically. For both static and external lookups, you stipulate the fields you want to match in the configuration file and the output from the lookup table that you defined in your transforms.conf file. It is okay to have multiple field lookups defined in one source lookup definition, but each lookup should have its own unique lookup name; for example, if you have multiple tables, you can name them LOOKUP-table01, LOOKUP-table02, and so on, or something perhaps more easily understood. If you add a lookup to your props.conf file, this lookup is automatically applied to all events from searches that have matching source types (again, as mentioned earlier; if your automatic lookup is very slow, it will also impact the speed of your searches). Restart Splunk to see your changes. Implementing a lookup using configuration files – an example To illustrate the use of configuration files in order to implement an automatic lookup, let's use a simple example. Once again, we want to convert a field from a unique identification code for an organization's business unit to a more user friendly descriptive name called BU Group. What we will do is match the field bu in a lookup table butobugroup.csv with a field in our events. Then, add the bugroup (description) to the returned events. The following shows the contents of the butobugroup.csv file: bu, bugroup 999999, leadership-groupVA0133SPS001, executive-group650914FAC002, technology-group You can put this file into $SPLUNK_HOME/etc/apps/<app_name>/lookups/ and carry out the following steps: Put the butobugroup.csv file into $SPLUNK_HOME/etc/apps/search/lookups/, since we are using the search app. As we mentioned earlier, we edit the transforms.conf file located at either $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/. We add the following two lines: [butobugroup]filename = butobugroup.csv Next, as mentioned earlier in this article, we edit the props.conf file located at either $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/. Here, we add the following two lines: [csv]LOOKUP-check = butobugroup bu AS 650693NLR001 OUTPUT bugroup Restart the Splunk server. You can (assuming you are logged in as an admin or have admin privileges) restart the Splunk server through the web interface by going to Settings, then select System and finally Server controls. Now, you can run a search for sourcetype=csv (as shown here): sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" ,650693NLR001 as "Business Unit" 100000 as "FCST"| eval RFCST= round(FCST) |Table "Business Unit", "Business Unit Name", bugroup, Month, RFCST You will see that the field bugroup can be returned as part of your event results:   Populating lookup tables Of course, you can create CSV files from external systems (or, perhaps even manually?), but from time to time, you might have the opportunity to create lookup CSV files (tables) from event data using Splunk. A handy command to accomplish this is outputcsv (which is covered in detail later in this article). The following is a simple example of creating a CSV file from Splunk event data that can be used for a lookup table: sourcetype=csv "Current Forecast" "Direct" | rename 650693NLR001 as "Business Unit" | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master The results are shown in the following screeshot:   Of course, the output table isn't quite usable, since the results have duplicates. Therefore, we can rewrite the Splunk search pipeline introducing the dedup command (as shown here): sourcetype=csv   "Current Forecast" "Direct"   | rename 650693NLR001 as "Business Unit" | dedup "Business Unit" | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master Then, we can examine the results (now with more desirable results):   Handling duplicates with dedup This command allows us to set the number of duplicate events to be kept based on the values of a field (in other words, we can use this command to drop duplicates from our event results for a selected field). The event returned for the dedup field will be the first event found (if you provide a number directly after the dedup command, it will be interpreted as the number of duplicate events to keep; if you don't specify a number, dedup keeps only the first occurring event and removes all consecutive duplicates). The dedup command also lets you sort by field or list of fields. This will remove all the duplicates and then sort the results based on the specified sort-by field. Adding a sort in conjunction with the dedup command can affect the performance as Splunk performs the dedup operation and then sorts the results as a final step. Here is a search command using dedup: sourcetype=csv   "Current Forecast" "Direct"   | rename 650693NLR001 as "Business Unit" | dedup "Business Unit" sortby bugroup | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master The result of the preceding command is shown in the following screenshot:   Now, we have our CSV lookup file (outputcsv splunk_master) generated and ready to be used:   Look for your generated output file in $SPLUNK_HOME/var/run/splunk. Dynamic lookups With a Splunk static lookup, your search reads through a file (a table) that was created or updated prior to executing the search. With dynamic lookups, the file is created at the time the search executes. This is possible because Splunk has the ability to execute an external command or script as part of your Splunk search. At the time of writing this book, Splunk only directly supports Python scripts for external lookups. If you are not familiar with Python, its implementation began in 1989 and is a widely used general-purpose, high-level programming language, which is often used as a scripting language (but is also used in a wide range of non-scripting contexts). Keep in mind that any external resources (such as a file) or scripts that you want to use with your lookup will need to be copied to a location where Splunk can find it. These locations are: $SPLUNK_HOME/etc/apps/<app_name>/bin $SPLUNK_HOME/etc/searchscripts The following sections describe the process of using the dynamic lookup example script that ships with Splunk (external_lookup.py). Using Splunk Web Just like with static lookups, Splunk makes it easy to define a dynamic or external lookup using the Splunk web interface. First, click on Settings and then select Lookups:   On the Lookups page, we can select Lookup table files to define a CSV file that contains the input file for our Python script. In the Add new page, we enter the following information: Destination app: For this field, select Search Upload a lookup file: Here, you can browse to the filename (my filename is dnsLookup.csv) Destination filename: Here, enter dnslookup The Add new page is shown in the following screenshot:   Now, click on Save. The lookup file (shown in the following screenshot) is a text CSV file that needs to (at a minimum) contain the two field names that the Python (py) script accepts as arguments, in this case, host and ip. As mentioned earlier, this file needs to be copied to $SPLUNK_HOME/etc/apps/<app_name>/bin.   Next, from the Lookups page, select Lookup definitions and then click on New. This is where you define your external lookup. Enter the following information: Type: For this, select External (as this lookup will run an external script) Command: For this, enter external_lookup.py host ip (this is the name of the py script and its two arguments) Supported fields: For this, enter host, ip (this indicates the two script input field names) The following screenshot describes a new lookup definition:   Now, click on Save. Using configuration files instead of Splunk Web Again, just like with static lookups in Splunk, dynamic lookups can also be configured in the Splunk transforms.conf file: [myLookup]external_cmd = external_lookup.py host ipexternal_type = pythonfields_list = host, ipmax_matches = 200 Let's learn more about the terms here: [myLookup]: This is the report stanza. external_cmd: This is the actual runtime command definition. Here, it executes the Python (py) script external_lookup, which requires two arguments (or parameters), host and ip. external_type (optional): This indicates that this is a Python script. Although this is an optional entry in the transform.conf file, it's a good habit to include this for readability and support. fields_list: This lists all the fields supported by the external command or script, delimited by a comma and space. The next step is to modify the props.conf file, as follows: [mylookup]LOOKUP-rdns = dnslookup host ip OUTPUT ip After updating the Splunk configuration files, you will need to restart Splunk. External lookups The external lookup example given uses a Python (py) script named external_lookup.py, which is a DNS lookup script that can return an IP address for a given host name or a host name for a provided IP address. Explanation The lookup table field in this example is named ip, so Splunk will mine all of the IP addresses found in the indexed logs' events and add the values of ip from the lookup table into the ip field in the search events. We can notice the following: If you look at the py script, you will notice that the example uses an MS Windows supported socket.gethostbyname_ex(host) function The host field has the same name in the lookup table and the events, so you don't need to do anything else Consider the following search command: sourcetype=tm1* | lookup dnslookup host | table host, ip When you run this command, Splunk uses the lookup table to pass the values for the host field as a CSV file (the text CSV file we looked at earlier) into the external command script. The py script then outputs the results (with both the host and ip fields populated) and returns it to Splunk, which populates the ip field in a result table:   Output of the py script with both the host and ip fields populated Time-based lookups If your lookup table has a field value that represents time, you can use the time field to set up a Splunk fields lookup. As mentioned earlier, the Splunk transforms.conf file can be modified to add a lookup stanza. For example, the following screenshot shows a file named MasteringDCHP.csv:   You can add the following code to the transforms.conf file: [MasteringDCHP]filename = MasteringDCHP.csvtime_field = TimeStamptime_format = %d/%m/%y %H:%M:%S $pmax_offset_secs = <integer>min_offset_secs = <integer> The file parameters are defined as follows: [MasteringDCHP]: This is the report stanza filename: This is the name of the CSV file to be used as the lookup table time_field: This is the field in the file that contains the time information and is to be used as the timestamp time_format: This indicates what format the time field is in max_offset_secs and min_offset_secs: This indicates min/max amount of offset time for an event to occur after a lookup entry Be careful with the preceding values; the offset relates to the timestamp in your lookup (CSV) file. Setting a tight (small) offset range might reduce the effectiveness of your lookup results! The last step will be to restart Splunk. An easier way to create a time-based lookup Again, it's a lot easier to use the Splunk Web interface to set up our lookup. Here is the step-by-step process: From Settings, select Lookups, and then Lookup table files: In the Lookup table files page, click on New, configure our lookup file, and then click on Save: You should receive the Successfully saved "MasterDHCP" in search message: Next, select Lookup definitions and from this page, click on New: In the Add new page, we define our lookup table with the following information: Destination app: For this, select search from the drop-down list Name: For this, enter MasterDHCP (this is the name you'll use in your lookup) Type: For this, select File-based (as this lookup table definition is a CSV file) Lookup file: For this, select the name of the file to be used from the drop-down list (ours is MasteringDCHP) Configure time-based lookup: Check this checkbox Name of time field: For this, enter TimeStamp (this is the field name in our file that contains the time information) Time format: For this, enter the string to describe to Splunk the format of our time field (our field uses this format: %d%m%y %H%M%S) You can leave the rest blank and click on Save. You should receive the Successfully saved "MasterDHCP" in search message: Now, we are ready to try our search: sourcetype=dh* | Lookup MasterDHCP IP as "IP" | table DHCPTimeStamp, IP, UserId | sort UserId The following screenshot shows the output:   Seeing double? Lookup table definitions are indicated with the attribute LOOKUP-<class> in the Splunk configuration file, props.conf, or in the web interface under Settings | Lookups | Lookup definitions. If you use the Splunk Web interface (which we've demonstrated throughout this article) to set up or define your lookup table definitions, Splunk will prevent you from creating duplicate table names, as shown in the following screenshot:   However, if you define your lookups using the configuration settings, it is important to try and keep your table definition names unique. If you do give the same name to multiple lookups, the following rules apply: If you have defined lookups with the same stanza (that is, using the same host, source, or source type), the first defined lookup in the configuration file wins and overrides all others. If lookups have different stanzas but overlapping events, the following logic is used by Splunk: Events that match the host get the host lookup Events that match the sourcetype get the sourcetype lookup Events that match both only get the host lookup It is a proven practice recommendation to make sure that all of your lookup stanzas have unique names. Command roundup This section lists several important Splunk commands you will use when working with lookups. The lookup command The Splunk lookup command is used to manually invoke field lookups using a Splunk lookup table that is previously defined. You can use Splunk Web (or the transforms.conf file) to define your lookups. If you do not specify OUTPUT or OUTPUTNEW, all fields in the lookup table (excluding the lookup match field) will be used by Splunk as output fields. Conversely, if OUTPUT is specified, the output lookup fields will overwrite existing fields and if OUTPUTNEW is specified, the lookup will not be performed for events in which the output fields already exist. For example, if you have a lookup table specified as iptousername with (at least) two fields, IP and UserId, for each event, Splunk will look up the value of the field IP in the table and for any entries that match, the value of the UserId field in the lookup table will be written to the field user_name in the event. The query is as follows: ... Lookup iptousernameIP as "IP" output UserId as user_name Always strive to perform lookups after any reporting commands in your search pipeline, so that the lookup only needs to match the results of the reporting command and not every individual event. The inputlookup and outputlookup commands The inputlookup command allows you to load search results from a specified static lookup table. It reads in a specified CSV filename (or a table name as specified by the stanza name in transforms.conf). If the append=t (that is, true) command is added, the data from the lookup file is appended to the current set of results (instead of replacing it). The outputlookup command then lets us write the results' events to a specified static lookup table (as long as this output lookup table is defined). So, here is an example of reading in the MasterDHCP lookup table (as specified in transforms.conf) and writing these event results to the lookup table definition NewMasterDHCP: | inputlookup MasterDHCP | outputlookup NewMasterDHCP After running the preceding command, we can see the following output:   Note that we can add the append=t command to the search in the following fashion: | inputlookup MasterDHCP.csv | inputlookup NewMasterDHCP.csv append=t | The inputcsv and outputcsv commands The inputcsv command is similar to the inputlookup command; in this, it loads search results, but this command loads from a specified CSV file. The filename must refer to a relative path in $SPLUNK_HOME/var/run/splunk and if the specified file does not exist and the filename did not have an extension, then a filename with a .csv extension is assumed. The outputcsv command lets us write our result events to a CSV file. Here is an example where we read in a CSV file named splunk_master.csv, search for the text phrase FPM, and then write any matching events to a CSV file named FPMBU.csv: | inputcsv splunk_master.csv | search "Business Unit Name"="FPM" | outputcsv FPMBU.csv The following screenshot shows the results from the preceding search command:   The following screenshot shows the resulting file generated as a result of the preceding command:   Here is another example where we read in the same CSV file (splunk_master.csv) and write out only events from 51 to 500: | inputcsv splunk_master start=50 max=500 Events are numbered starting with zero as the first entry (rather than 1). Summary In this article, we defined Splunk lookups and discussed their value. We also went through the two types of lookups, static and dynamic, and saw detailed, working examples of each. Various Splunk commands typically used with the lookup functionality were also presented. Resources for Article: Further resources on this subject: Working with Apps in Splunk [article] Processing Tweets with Apache Hive [article] Indexes [article]
Read more
  • 0
  • 0
  • 10411

article-image-adding-graded-activities
Packt
16 Dec 2014
9 min read
Save for later

Adding Graded Activities

Packt
16 Dec 2014
9 min read
This article by Rebecca Barrington, author of Moodle Gradebook Second Edition, teaches you how to add assignments and set up how they will be graded, including how to use our custom scales and add outcomes for grading. (For more resources related to this topic, see here.) As with all content within Moodle, we need to select Turn editing on within the course in order to be able to add resources and activities. All graded activities are added through the Add an activity or resource text available within each section of within a Moodle course. This text can be found in the bottom right of each section after editing has been turned on. There are a number of items that can be graded and will appear within the Gradebook. Assignments are the most feature-rich of all the graded activities and have many options available in order to customize how assessments can be graded. They can be used to provide assessment information for students, store grades, and provide feedback. When setting up the assignment, we can choose for students to submit their work electronically—either through file submission or online text, or we can review the assessment offline and use only the grade and feedback features of the assignment. Adding assignments There are many options *within the assignments, and throughout this article we will set up a number of different assignments and you'll learn about some of their most useful features and options. Let's have a go at creating a range of assignments that are ready for grading. Creating an assignment with a scale The first assignment that we will add will *make use of the PMD scale Click on the Turn editing on button. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 1). In the Description box, provide some assignment details. In the Availability section, we need to disable the date options. We will not make use of these options, but they can be very useful. To disable the options, click on the tick next to the Enable text. However, details of these options have *been provided for future* reference. The Allow submissions from section* is mostly relevant when the assignment will be submitted electronically, as students won't be able to submit their work until the date and time indicated here. The Due date section* can be used to indicate when the assignment needs to be submitted by. If students electronically submit their assignment after the date and time indicated here, the submission date and time will be shown in red in order to notify the teacher that it was submitted past the due date. The Cut off date section* enables teachers to set an extension period after the due date where late submissions will continue to be accepted. In the* Submission types section, ensure *that the File submissions checkbox is enabled by adding a tick there. This will enable students to submit their assignment electronically. There are additional options that we can choose as well. With Maximum number of uploaded files, we can indicate how many files a student can upload. Keep this as 1. We can also determine the Maximum submission size option for each file using the drop-down list shown in the following screenshot: Within the Feedback types section, ensure that all options under the Feedback types *section are *selected. Feedback comments enables *us to provide written feedback along with the grade. Feedback files enables us *to upload a file in order to provide feedback to a student. Offline grading worksheet will *provide us with the option to download a .csv file that contains core information about the assignment, and this can be used to add grades and feedback while working offline. This completed .csv file can be uploaded and the grades will be added to the assignments within the Gradebook. In the Submission settings section, we have options related to how students will submit their assignment and how they will reattempt submission if required. If Require students click submit button is left as No, students will upload* their assignment* and it will be available *to the teacher for grading. If this option is changed to Yes, students can upload their assignment, but the teacher will see that it is in the draft form. Students will click on Submit to indicate that it is ready to be graded. Require that students accept the submission statement will provide students *with a statement that they need to agree to when they submit their assignment. The default statement is This assignment is my own work, except where I have acknowledged the use of works of other people. The submission statement can be changed by a site administrator by navigating to Site administration | plugins | Activity modules | Assignment settings. The Attempts reopened drop-down list* provides options for the status of the assignment after it has been graded. Students will only be able to resubmit their work when it is open. Therefore this setting will control when and if students are able to submit another version of their assignment. The options available to us are:Never: This option should be selected if students will not be able to submit another piece of work.Manually: This will enable anyone who has the role of a teacher to choose to reopen a submission that enables a student to submit their work again.Automatically until pass: This option works when a pass grade is set within the Gradebook. After grading, if the student is awarded the minimum pass *grade or higher, the submission *will remain closed in order to prevent any changes to the submission. However, if the assignment is graded lower than the assigned pass grade, the submission will automatically reopen in order to enable the student to submit the assignment again.Maximum attempts: The maximum *attempts allowed for this assignment will limit the number of times an assignment is reopened. For example, if this option is set to 3, then a student will only be able to submit their assignment three times. After they have submitted their assignment for a third time, they will not be allowed to submit it again. The default is unlimited, but it can be changed by clicking on the drop-down list. In the Submission settings section, ensure that the options for Require students click on submit button and Require that students accept the submission statement are set to Yes. Also, change the Attempts reopened to Automatically until passed. Within the Grade section, navigate to Grade | Type | Scale and choose the PMD scale. Select Use marking workflow by changing the drop-down list to Yes.Use marking workflow is a new feature of Moodle 2.6* that enables *the grading process to go through a range of stages in order to indicate that the marking is in progress or is complete, is being reviewed, or is ready for release to students. Click on* Save and return to course. Creating an online assignment with a number grade The next *assignment that we will create will have an online* text option that will have a maximum grade of 20. The following steps show you how to create an online assignment with a number grade: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 2). In the Description box, provide the assignment details. In the Submission types section, ensure that Online text has a tick next to it. This will enable students to type directly into Moodle. When choosing this option, we can also set a maximum word limit by clicking on the tick box next to the Enable text. After enabling this option, we can add a number to the textbox. For this assignment, enable a word limit of 200 words. When using* online text* submission, we have an additional feedback option within the Feedback types section. Under the Comment inline text, click on No and switch to Yes to enable yourself to add written feedback for students within the written text submitted by students. In the Submission settings section, ensure that the options for Require students click submit button and Require that students accept the submission statement are set to Yes. Also, change Attempts reopened to Automatically until passed. Within the Grades section, navigate to Grade | Type | Point and ensure that Maximum points is set to 20. Click *on* Save and return to course. Creating an assignment including outcomes The next assignment that we will *create will add some of the Outcomes: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 3). In the Description box, provide the assignment details. In the Submission types box, ensure that Online text and File submissions are selected. Set Maximum number of uploaded files to 2. In the Submission settings section, ensure that the options for Require students to click submit button and Require that students accept the submission statement are amended to Yes. Change Attempts reopened to Manually. Within the Grades section, navigate to Grade | Type | Point and Maximum points is set to 100. In the Outcomes *section, choose the outcomes as Evidence provided and Criteria 1 met. Scroll to the bottom of the screen and click on Save and return to course. Summary In this article, we added a range of assignments that made use of number and scale grades as well as added outcomes to an assignment. Resources for Article: Further resources on this subject: Moodle for Online Communities [article] What's New in Moodle 2.0 [article] Moodle 2.0: What's New in Add a Resource [article]
Read more
  • 0
  • 0
  • 2660

article-image-ridge-regression
Packt
16 Dec 2014
9 min read
Save for later

Ridge Regression

Packt
16 Dec 2014
9 min read
In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]],                                    val y: DblVector,                                   val lambda: Double) {                   extends AbstractMultipleLinearRegression                    with PipeOperator[Array[T], Double] {    private var qr: QRDecomposition = null    private[this] val model: Option[RegressionModel] = …    … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x)   //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1   val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price                                .drop(1)                                .zip(_price.take(_price.size -1))                                .map( z => z._1 - z._2)) //2 val data = volatility.get                      .zip(volume.get)                      .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4   regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]
Read more
  • 0
  • 0
  • 6786
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-about-mongodb
Packt
27 Nov 2014
17 min read
Save for later

About MongoDB

Packt
27 Nov 2014
17 min read
In this article by Amol Nayak, the author of MongoDB Cookbook, describes the various features of MongoDB. (For more resources related to this topic, see here.) MongoDB is a document-oriented database and is the most popular and favorite NoSQL database. The rankings given at http://db-engines.com/en/ranking shows us that MongoDB is sitting on the fifth rank overall as of August 2014 and is the first NoSQL product in this list. It is currently being used in production by a huge list of companies in various domains handling terabytes of data efficiently. MongoDB is developed to scale horizontally and cope up with the increasing data volumes. It is very simple to use and get started with, backed by a good support from its company MongoDB and has a vast array open source and proprietary tools build around it to improve developer and administrator's productivity. In this article, we will cover the following recipes: Single node installation of MongoDB with options from the config file Viewing database stats Creating an index and viewing plans of queries Single node installation of MongoDB with options from the config file As we're aware that providing options from the command line does the work, but it starts getting awkward as soon as the number of options we provide increases. We have a nice and clean alternative to providing the startup options from a configuration file rather than as command-line arguments. Getting ready Well, assuming that we have downloaded the MongoDB binaries from the download site, extracted it, and have the bin directory of MongoDB in the operating system's path variable (this is not mandatory but it really becomes convenient after doing it), the binaries can be downloaded from http://www.mongodb.org/downloads after selecting your host operating system. How to do it… The /data/mongo/db directory for the database and /logs/ for the logs should be created and present on your filesystem, with the appropriate permissions to write to it. Let's take a look at the steps in detail: Create a config file, which can have any arbitrary name. In our case, let's say we create the file at /conf/mongo.conf. We will then edit the file and add the following lines of code to it: port = 27000 dbpath = /data/mongo/db logpath = /logs/mongo.log smallfiles = true Start the Mongo server using the following command: > mongod --config /config/mongo.conf How it works… The properties are specified as <property name> = <value>. For all those properties that don't have values, for example, the smallfiles option, the value given is a Boolean value, true. If you need to have a verbose output, you will add v=true (or multiple v's to make it more verbose) to our config file. If you already know what the command-line option is, it is pretty easy to guess the value of the property in the file. It is the same as the command-line option, with just the hyphen removed. Viewing database stats In this recipe, we will see how to get the statistics of a database. Getting ready To find the stats of the database, we need to have a server up and running, and a single node is what should be ok. The data on which we would be operating needs to be imported into the database. Once these steps are completed, we are all set to go ahead with this recipe. How to do it… We will be using the test database for the purpose of this recipe. It already has the postalCodes collection in it. Let's take a look at the steps in detail: Connect to the server using the Mongo shell by typing in the following command from the operating system terminal (it is assumed that the server is listening to port 27017): $ mongo On the shell, execute the following command and observe the output: > db.stats() Now, execute the following command, but this time with the scale parameter (observe the output): > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : {    "major" : 4,    "minor" : 5 }, "ok" : 1 } How it works… Let us start by looking at the collections field. If you look carefully at the number and also execute the show collections command on the Mongo shell, you shall find one extra collection in the stats as compared to those by executing the command. The difference is for one collection, which is hidden, and its name is system.namespaces. You may execute db.system.namespaces.find() to view its contents. Getting back to the output of stats operation on the database, the objects field in the result has an interesting value too. If we find the count of documents in the postalCodes collection, we see that it is 39732. The count shown here is 39738, which means there are six more documents. These six documents come from the system.namespaces and system.indexes collection. Executing a count query on these two collections will confirm it. Note that the test database doesn't contain any other collection apart from postalCodes. The figures will change if the database contains more collections with documents in it. The scale parameter, which is a parameter to the stats function, divides the number of bytes with the given scale value. In this case, it is 1024, and hence, all the values will be in KB. Let's analyze the output: > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : {    "major" : 4,    "minor" : 5 }, "ok" : 1 } The following table shows the meaning of the important fields: Field Description db This is the name of the database whose stats are being viewed. collections This is the total number of collections in the database. objects This is the count of documents across all collections in the database. If we find the stats of a collection by executing db.<collection>.stats(), we get the count of documents in the collection. This attribute is the sum of counts of all the collections in the database. avgObjectSize This is simply the size (in bytes) of all the objects in all the collections in the database, divided by the count of the documents across all the collections. This value is not affected by the scale provided even though this is a size field. dataSize This is the total size of the data held across all the collections in the database. This value is affected by the scale provided. storageSize This is the total amount of storage allocated to collections in this database for storing documents. This value is affected by the scale provided. numExtents This is the count of all the number of extents in the database across all the collections. This is basically the sum of numExtents in the collection stats for collections in this database. indexes This is the sum of number of indexes across all collections in the database. indexSize This is the size (in bytes) for all the indexes of all the collections in the database. This value is affected by the scale provided. fileSize This is simply the addition of the size of all the database files you should find on the filesystem for this database. The files will be named test.0, test.1, and so on for the test database. This value is affected by the scale provided. nsSizeMB This is the size of the file in MBs for the .ns file of the database. Another thing to note is the value of the avgObjectSize, and there is something weird in this value. Unlike this very field in the collection's stats, which is affected by the value of the scale provided. In database stats, this value is always in bytes, which is pretty confusing and one cannot really be sure why this is not scaled according to the provided scale. Creating an index and viewing plans of queries In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes. Getting ready For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. The data with which we will be operating needs to be imported in the database. Once we have this prerequisite, we are good to go. How to do it… We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps: Execute the following query to view the plan of a query: > db.postalCodes.find({state:'Maharashtra'}).explain() Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation Let's execute the same query again, but this time, we will limit the results to only 100 results: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result We will now create an index on the state and pincode fields as follows: > db.postalCodes.ensureIndex({state:1, pincode:1}) Execute the following query: > db.postalCodes.find({state:'Maharashtra'}).explain() Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result Since we want only the pin codes, we will modify the query as follows and view its plan: > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result. How it works… There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some gotchas. Analysis of the plan Let's look at the first step and analyze the output we executed: > db.postalCodes.find({state:'Maharashtra'}).explain() The output on my machine is as follows (I am skipping the nonrelevant fields for now): {        "cursor" : "BasicCursor",          "n" : 6446,        "nscannedObjects" : 39732,        "nscanned" : 39732,          …        "millis" : 55, …       } The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result. Improving the query execution time So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()   { "cursor" : "BasicCursor", …      "n" : 100,      "nscannedObjects" : 19951,      "nscanned" : 19951,        …      "millis" : 30,        … } The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important, as in the absence of index, the collection might still be completely scanned if the number of matches is not met. Improvement using indexes Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement: {        "cursor" : "BtreeCursor state_1_pincode_1", …          "n" : 6446,        "nscannedObjects" : 6446,        "nscanned" : 6446, …        "indexOnly" : false, …        "millis" : 16, … } The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we now used an index that gave us the starting document from which we could scan, and then, only the required number of documents were scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis has come down too, as expected. Improvement using covered indexes This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate. Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another IO operation to fetch the complete document from the collection; this document is then returned to the user. The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection will not be necessary, and the data from the index will be returned. This is called covered index, and the value of indexOnly, in this case, will be true. In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query results in the following plan: db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() {        "cursor" : "BtreeCursor state_1_pincode_1", …        "n" : 6446,        "nscannedObjects" : 0,        "nscanned" : 6446, … "indexOnly" : true, …            "millis" : 15, … } The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index, and thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional IO was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0. As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance. Another thing to remember is that wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to use it set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection. Some gotchas of index creations We will now see some pitfalls in index creation and some facts where the array field is used in the index. Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries. What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, not more than one field can be of the array type. This is done to prevent the possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields. This should be good enough for now to scratch the surfaces of plain vanilla index. Summary This article provides detailed recipes that describe how to use the different features of MongoDB. MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability, thus making it a good contender for high-volume, high-performance systems across all business domains. It has an edge over the majority of NoSQL solutions for its ease of use, high performance, and rich features. In this article, we learned how to start single node installations of MongoDB with options from the config file. We also learned how to create an index from the shell and viewing plans of queries. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] MongoDB data modeling [Article] Using Mongoid [Article]
Read more
  • 0
  • 0
  • 3259

article-image-logistic-regression
Packt
27 Nov 2014
9 min read
Save for later

Logistic regression

Packt
27 Nov 2014
9 min read
This article is written by Breck Baldwin and Krishna Dayanidhi, the authors of Natural Language Processing with Java and LingPipe Cookbook. In this article, we will cover logistic regression. (For more resources related to this topic, see here.) Logistic regression is probably responsible for the majority of industrial classifiers, with the possible exception of naïve Bayes classifiers. It almost certainly is one of the best performing classifiers available, albeit at the cost of slow training and considerable complexity in configuration and tuning. Logistic regression is also known as maximum entropy, neural network classification with a single neuron, and others. The classifiers have been based on the underlying characters or tokens, but logistic regression uses unrestricted feature extraction, which allows for arbitrary observations of the situation to be encoded in the classifier. This article closely follows a more complete tutorial at http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html. How logistic regression works All that logistic regression does is take a vector of feature weights over the data, apply a vector of coefficients, and do some simple math, which results in a probability for each class encountered in training. The complicated bit is in determining what the coefficients should be. The following are some of the features produced by our training example for 21 tweets annotated for English e and non-English n. There are relatively few features because feature weights are being pushed to 0.0 by our prior, and once a weight is 0.0, then the feature is removed. Note that one category, n, is set to 0.0 for all the features of the n-1 category—this is a property of the logistic regression process that fixes once categories features to 0.0 and adjust all other categories features with respect to that: FEATURE e nI : 0.37 0.0! : 0.30 0.0Disney : 0.15 0.0" : 0.08 0.0to : 0.07 0.0anymore : 0.06 0.0isn : 0.06 0.0' : 0.06 0.0t : 0.04 0.0for : 0.03 0.0que : -0.01 0.0moi : -0.01 0.0_ : -0.02 0.0, : -0.08 0.0pra : -0.09 0.0? : -0.09 0.0 Take the string, I luv Disney, which will only have two non-zero features: I=0.37 and Disney=0.15 for e and zeros for n. Since there is no feature that matches luv, it is ignored. The probability that the tweet is English breaks down to: vectorMultiply(e,[I,Disney]) = exp(.37*1 + .15*1) = 1.68 vectorMultiply(n,[I,Disney]) = exp(0*1 + 0*1) = 1 We will rescale to a probability by summing the outcomes and dividing it: p(e|,[I,Disney]) = 1.68/(1.68 +1) = 0.62p(e|,[I,Disney]) = 1/(1.68 +1) = 0.38 This is how the math works on running a logistic regression model. Training is another issue entirely. Getting ready This example assumes the same framework that we have been using all along to get training data from .csv files, train the classifier, and run it from the command line. Setting up to train the classifier is a bit complex because of the number of parameters and objects used in training. The main() method starts with what should be familiar classes and methods: public static void main(String[] args) throws IOException {String trainingFile = args.length > 0 ? args[0]: "data/disney_e_n.csv";List<String[]> training= Util.readAnnotatedCsvRemoveHeader(new File(trainingFile));int numFolds = 0;XValidatingObjectCorpus<Classified<CharSequence>> corpus= Util.loadXValCorpus(training,numFolds);TokenizerFactory tokenizerFactory= IndoEuropeanTokenizerFactory.INSTANCE; Note that we are using XValidatingObjectCorpus when a simpler implementation such as ListCorpus will do. We will not take advantage of any of its cross-validation features, because the numFolds param as 0 will have training visit the entire corpus. We are trying to keep the number of novel classes to a minimum, and we tend to always use this implementation in real-world gigs anyway. Now, we will start to build the configuration for our classifier. The FeatureExtractor<E> interface provides a mapping from data to features; this will be used to train and run the classifier. In this case, we are using a TokenFeatureExtractor() method, which creates features based on the tokens found by the tokenizer supplied during construction. This is similar to what naïve Bayes reasons over: FeatureExtractor<CharSequence> featureExtractor= new TokenFeatureExtractor(tokenizerFactory); The minFeatureCount item is usually set to a number higher than 1, but with small training sets, this is needed to get any performance. The thought behind filtering feature counts is that logistic regression tends to overfit low-count features that, just by chance, exist in one category of training data. As training data grows, the minFeatureCount value is adjusted usually by paying attention to cross-validation performance: int minFeatureCount = 1; The addInterceptFeature Boolean controls whether a category feature exists that models the prevalence of the category in training. The default name of the intercept feature is *&^INTERCEPT%$^&**, and you will see it in the weight vector output if it is being used. By convention, the intercept feature is set to 1.0 for all inputs. The idea is that if a category is just very common or very rare, there should be a feature that captures just this fact, independent of other features that might not be as cleanly distributed. This models the category probability in naïve Bayes in some way, but the logistic regression algorithm will decide how useful it is as it does with all other features: boolean addInterceptFeature = true;boolean noninformativeIntercept = true; These Booleans control what happens to the intercept feature if it is used. Priors, in the following code, are typically not applied to the intercept feature; this is the result if this parameter is true. Set the Boolean to false, and the prior will be applied to the intercept. Next is the RegressionPrior instance, which controls how the model is fit. What you need to know is that priors help prevent logistic regression from overfitting the data by pushing coefficients towards 0. There is a non-informative prior that does not do this with the consequence that if there is a feature that applies to just one category it will be scaled to infinity, because the model keeps fitting better as the coefficient is increased in the numeric estimation. Priors, in this context, function as a way to not be over confident in observations about the world. Another dimension in the RegressionPrior instance is the expected variance of the features. Low variance will push coefficients to zero more aggressively. The prior returned by the static laplace() method tends to work well for NLP problems. There is a lot going on, but it can be managed without a deep theoretical understanding. double priorVariance = 2;RegressionPrior prior= RegressionPrior.laplace(priorVariance,noninformativeIntercept); Next, we will control how the algorithm searches for an answer. AnnealingSchedule annealingSchedule= AnnealingSchedule.exponential(0.00025,0.999);double minImprovement = 0.000000001;int minEpochs = 100;int maxEpochs = 2000; AnnealingSchedule is best understood by consulting the Javadoc, but what it does is change how much the coefficients are allowed to vary when fitting the model. The minImprovement parameter sets the amount the model fit has to improve to not terminate the search, because the algorithm has converged. The minEpochs parameter sets a minimal number of iterations, and maxEpochs sets an upper limit if the search does not converge as determined by minImprovement. Next is some code that allows for basic reporting/logging. LogLevel.INFO will report a great deal of information about the progress of the classifier as it tries to converge: PrintWriter progressWriter = new PrintWriter(System.out,true);progressWriter.println("Reading data.");Reporter reporter = Reporters.writer(progressWriter);reporter.setLevel(LogLevel.INFO); Here ends the Getting ready section of one of our most complex classes—next, we will train and run the classifier. How to do it... It has been a bit of work setting up to train and run this class. We will just go through the steps to get it up and running: Note that there is a more complex 14-argument train method as well the one that extends configurability. This is the 10-argument version: LogisticRegressionClassifier<CharSequence> classifier= LogisticRegressionClassifier.<CharSequence>train(corpus,featureExtractor,minFeatureCount,addInterceptFeature,prior,annealingSchedule,minImprovement,minEpochs,maxEpochs,reporter); The train() method, depending on the LogLevel constant, will produce from nothing with LogLevel.NONE to the prodigious output with LogLevel.ALL. While we are not going to use it, we show how to serialize the trained model to disk: AbstractExternalizable.compileTo(classifier, new File("models/myModel.LogisticRegression")); Once trained, we will apply the standard classification loop with: Util.consoleInputPrintClassification(classifier); Run the preceding code in the IDE of your choice or use the command-line command: java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/opencsv-2.4.jar com.lingpipe.cookbook.chapter3.TrainAndRunLogReg The result is a big dump of information about the training: Reading data.:00 Feature Extractor class=class com.aliasi.tokenizer.TokenFeatureExtractor:00 min feature count=1:00 Extracting Training Data:00 Cold start:00 Regression callback handler=null:00 Logistic Regression Estimation:00 Monitoring convergence=true:00 Number of dimensions=233:00 Number of Outcomes=2:00 Number of Parameters=233:00 Number of Training Instances=21:00 Prior=LaplaceRegressionPrior(Variance=2.0,noninformativeIntercept=true):00 Annealing Schedule=Exponential(initialLearningRate=2.5E-4,base=0.999):00 Minimum Epochs=100:00 Maximum Epochs=2000:00 Minimum Improvement Per Period=1.0E-9:00 Has Informative Prior=true:00 epoch= 0 lr=0.000250000 ll= -20.9648 lp=-232.0139 llp= -252.9787 llp*= -252.9787:00 epoch= 1 lr=0.000249750 ll= -20.9406 lp=-232.0195 llp= -252.9602 llp*= -252.9602 The epoch reporting goes on until either the number of epochs is met or the search converges. In the following case, the number of epochs was met: :00 epoch= 1998 lr=0.000033868 ll= -15.4568 lp= -233.8125 llp= -249.2693 llp*= -249.2693 :00 epoch= 1999 lr=0.000033834 ll= -15.4565 lp= -233.8127 llp= -249.2692 llp*= -249.2692 Now, we can play with the classifier a bit: Type a string to be classified. Empty string to quit. I luv Disney Rank Category Score P(Category|Input) 0=e 0.626898085027528 0.626898085027528 1=n 0.373101914972472 0.373101914972472 This should look familiar; it is exactly the same result as the worked example at the start. That's it! You have trained up and used the world's most relevant industrial classifier. However, there's a lot more to harnessing the power of this beast. Summary In this article, we learned how to do logistic regression. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] Introspecting Maya, Python, and PyMEL [Article] Understanding the Python regex engine [Article]
Read more
  • 0
  • 0
  • 1775

article-image-creating-reusable-actions-agent-behaviors-lua
Packt
27 Nov 2014
18 min read
Save for later

Creating reusable actions for agent behaviors with Lua

Packt
27 Nov 2014
18 min read
In this article by David Young, author of Learning Game AI Programming with Lua, we will create reusable actions for agent behaviors. (For more resources related to this topic, see here.) Creating userdata So far we've been using global data to store information about our agents. As we're going to create decision structures that require information about our agents, we'll create a local userdata table variable that contains our specific agent data as well as the agent controller in order to manage animation handling: local userData = {    alive, -- terminal flag    agent, -- Sandbox agent    ammo, -- current ammo    controller, -- Agent animation controller    enemy, -- current enemy, can be nil    health, -- current health    maxHealth -- max Health }; Moving forward, we will encapsulate more and more data as a means of isolating our systems from global variables. A userData table is perfect for storing any arbitrary piece of agent data that the agent doesn't already possess and provides a common storage area for data structures to manipulate agent data. So far, the listed data members are some common pieces of information we'll be storing; when we start creating individual behaviors, we'll access and modify this data. Agent actions Ultimately, any decision logic or structure we create for our agents comes down to deciding what action our agent should perform. Actions themselves are isolated structures that will be constructed from three distinct states: Uninitialized Running Terminated The typical lifespan of an action begins in uninitialized state and will then become initialized through a onetime initialization, and then, it is considered to be running. After an action completes the running phase, it moves to a terminated state where cleanup is performed. Once the cleanup of an action has been completed, actions are once again set to uninitialized until they wait to be reactivated. We'll start defining an action by declaring the three different states in which actions can be as well as a type specifier, so our data structures will know that a specific Lua table should be treated as an action. Remember, even though we use Lua in an object-oriented manner, Lua itself merely creates each instance of an object as a primitive table. It is up to the code we write to correctly interpret different tables as different objects. The use of a Type variable that is moving forward will be used to distinguish one class type from another. Action.lua: Action = {};   Action.Status = {    RUNNING = "RUNNING",    TERMINATED = "TERMINATED",    UNINITIALIZED = "UNINITIALIZED" };   Action.Type = "Action"; Adding data members To create an action, we'll pass three functions that the action will use for the initialization, updating, and cleanup. Additional information such as the name of the action and a userData variable, used for passing information to each callback function, is passed in during the construction time. Moving our systems away from global data and into instanced object-oriented patterns requires each instance of an object to store its own data. As our Action class is generic, we use a custom data member, which is userData, to store action-specific information. Whenever a callback function for the action is executed, the same userData table passed in during the construction time will be passed into each function. The update callback will receive an additional deltaTimeInMillis parameter in order to perform any time specific update logic. To flush out the Action class' constructor function, we'll store each of the callback functions as well as initialize some common data members: Action.lua: function Action.new(name, initializeFunction, updateFunction,        cleanUpFunction, userData)      local action = {};       -- The Action's data members.    action.cleanUpFunction_ = cleanUpFunction;    action.initializeFunction_ = initializeFunction;    action.updateFunction_ = updateFunction;    action.name_ = name or "";    action.status_ = Action.Status.UNINITIALIZED;    action.type_ = Action.Type;    action.userData_ = userData;           return action; end Initializing an action Initializing an action begins by calling the action's initialize callback and then immediately sets the action into a running state. This transitions the action into a standard update loop that is moving forward: Action.lua: function Action.Initialize(self)    -- Run the initialize function if one is specified.    if (self.status_ == Action.Status.UNINITIALIZED) then        if (self.initializeFunction_) then            self.initializeFunction_(self.userData_);        end    end       -- Set the action to running after initializing.    self.status_ = Action.Status.RUNNING; end Updating an action Once an action has transitioned to a running state, it will receive callbacks to the update function every time the agent itself is updated, until the action decides to terminate. To avoid an infinite loop case, the update function must return a terminated status when a condition is met; otherwise, our agents will never be able to finish the running action. An update function isn't a hard requirement for our actions, as actions terminate immediately by default if no callback function is present. Action.lua: function Action.Update(self, deltaTimeInMillis)    if (self.status_ == Action.Status.TERMINATED) then        -- Immediately return if the Action has already        -- terminated.        return Action.Status.TERMINATED;    elseif (self.status_ == Action.Status.RUNNING) then        if (self.updateFunction_) then            -- Run the update function if one is specified.                      self.status_ = self.updateFunction_(                deltaTimeInMillis, self.userData_);              -- Ensure that a status was returned by the update            -- function.            assert(self.status_);        else            -- If no update function is present move the action            -- into a terminated state.            self.status_ = Action.Status.TERMINATED;        end    end      return self.status_; end Action cleanup Terminating an action is very similar to initializing an action, and it sets the status of the action to uninitialized once the cleanup callback has an opportunity to finish any processing of the action. If a cleanup callback function isn't defined, the action will immediately move to an uninitialized state upon cleanup. During action cleanup, we'll check to make sure the action has fully terminated, and then run a cleanup function if one is specified. Action.lua: function Action.CleanUp(self)    if (self.status_ == Action.Status.TERMINATED) then        if (self.cleanUpFunction_) then            self.cleanUpFunction_(self.userData_);        end    end       self.status_ = Action.Status.UNINITIALIZED; end Action member functions Now that we've created the basic, initialize, update, and terminate functionalities, we can update our action constructor with CleanUp, Initialize, and Update member functions: Action.lua: function Action.new(name, initializeFunction, updateFunction,        cleanUpFunction, userData)       ...      -- The Action's accessor functions.    action.CleanUp = Action.CleanUp;    action.Initialize = Action.Initialize;    action.Update = Action.Update;       return action; end Creating actions With a basic action class out of the way, we can start implementing specific action logic that our agents can use. Each action will consist of three callback functions—initialization, update, and cleanup—that we'll use when we instantiate our action instances. The idle action The first action we'll create is the basic and default choice from our agents that are going forward. The idle action wraps the IDLE animation request to our soldier's animation controller. As the animation controller will continue looping our IDLE command until a new command is queued, we'll time our idle action to run for 2 seconds, and then terminate it to allow another action to run: SoldierActions.lua: function SoldierActions_IdleCleanUp(userData)    -- No cleanup is required for idling. end   function SoldierActions_IdleInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.IDLE);           -- Since idle is a looping animation, cut off the idle    -- Action after 2 seconds.    local sandboxTimeInMillis = Sandbox.GetTimeInMillis(        userData.agent:GetSandbox());    userData.idleEndTime = sandboxTimeInMillis + 2000; end Updating our action requires that we check how much time has passed; if the 2 seconds have gone by, we terminate the action by returning the terminated state; otherwise, we return that the action is still running: SoldierActions.lua: function SoldierActions_IdleUpdate(deltaTimeInMillis, userData)    local sandboxTimeInMillis = Sandbox.GetTimeInMillis(        userData.agent:GetSandbox());    if (sandboxTimeInMillis >= userData.idleEndTime) then        userData.idleEndTime = nil;        return Action.Status.TERMINATED;    end    return Action.Status.RUNNING; end As we'll be using our idle action numerous times, we'll create a wrapper around initializing our action based on our three functions: SoldierLogic.lua: local function IdleAction(userData)    return Action.new(        "idle",        SoldierActions_IdleInitialize,        SoldierActions_IdleUpdate,        SoldierActions_IdleCleanUp,        userData); end The die action Creating a basic death action is very similar to our idle action. In this case, as death in our animation controller is a terminating state, all we need to do is request that the DIE command be immediately executed. From this point, our die action is complete, and it's the responsibility of a higher-level system to stop any additional processing of logic behavior. Typically, our agents will request this state when their health drops to zero. In the special case that our agent dies due to falling, the soldier's animation controller will manage the correct animation playback and set the soldier's health to zero: SoldierActions.lua: function SoldierActions_DieCleanUp(userData)    -- No cleanup is required for death. end   function SoldierActions_DieInitialize(userData)    -- Issue a die command and immediately terminate.    userData.controller:ImmediateCommand(        userData.agent,        SoldierController.Commands.DIE);      return Action.Status.TERMINATED; end   function SoldierActions_DieUpdate(deltaTimeInMillis, userData)    return Action.Status.TERMINATED; end Creating a wrapper function to instantiate a death action is identical to our idle action: SoldierLogic.lua: local function DieAction(userData)    return Action.new(        "die",        SoldierActions_DieInitialize,        SoldierActions_DieUpdate,        SoldierActions_DieCleanUp,        userData); end The reload action Reloading is the first action that requires an animation to complete before we can consider the action complete, as the behavior will refill our agent's current ammunition count. As our animation controller is queue-based, the action itself never knows how many commands must be processed before the reload command has finished executing. To account for this during the update loop of our action, we wait till the command queue is empty, as the reload action will be the last command that will be added to the queue. Once the queue is empty, we can terminate the action and allow the cleanup function to award the ammo: SoldierActions.lua: function SoldierActions_ReloadCleanUp(userData)    userData.ammo = userData.maxAmmo; end   function SoldierActions_ReloadInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.RELOAD);    return Action.Status.RUNNING; end   function SoldierActions_ReloadUpdate(deltaTimeInMillis, userData)    if (userData.controller:QueueLength() > 0) then        return Action.Status.RUNNING;    end       return Action.Status.TERMINATED; end SoldierLogic.lua: local function ReloadAction(userData)    return Action.new(        "reload",        SoldierActions_ReloadInitialize,        SoldierActions_ReloadUpdate,        SoldierActions_ReloadCleanUp,        userData); end The shoot action Shooting is the first action that directly interacts with another agent. In order to apply damage to another agent, we need to modify how the soldier's shots deal with impacts. When the soldier shot bullets out of his rifle, we added a callback function to handle the cleanup of particles; now, we'll add an additional functionality in order to decrement an agent's health if the particle impacts an agent: Soldier.lua: local function ParticleImpact(sandbox, collision)    Sandbox.RemoveObject(sandbox, collision.objectA);       local particleImpact = Core.CreateParticle(        sandbox, "BulletImpact");    Core.SetPosition(particleImpact, collision.pointA);    Core.SetParticleDirection(        particleImpact, collision.normalOnB);      table.insert(        impactParticles,        { particle = particleImpact, ttl = 2.0 } );       if (Agent.IsAgent(collision.objectB)) then        -- Deal 5 damage per shot.        Agent.SetHealth(            collision.objectB,            Agent.GetHealth(collision.objectB) - 5);    end end Creating the shooting action requires more than just queuing up a shoot command to the animation controller. As the SHOOT command loops, we'll queue an IDLE command immediately afterward so that the shoot action will terminate after a single bullet is fired. To have a chance at actually shooting an enemy agent, though, we first need to orient our agent to face toward its enemy. During the normal update loop of the action, we will forcefully set the agent to point in the enemy's direction. Forcefully setting the agent's forward direction during an action will allow our soldier to shoot but creates a visual artifact where the agent will pop to the correct forward direction. See whether you can modify the shoot action's update to interpolate to the correct forward direction for better visual results. SoldierActions.lua: function SoldierActions_ShootCleanUp(userData)    -- No cleanup is required for shooting. end   function SoldierActions_ShootInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.SHOOT);    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.IDLE);       return Action.Status.RUNNING; end   function SoldierActions_ShootUpdate(deltaTimeInMillis, userData)    -- Point toward the enemy so the Agent's rifle will shoot    -- correctly.    local forwardToEnemy = userData.enemy:GetPosition() –        userData.agent:GetPosition();    Agent.SetForward(userData.agent, forwardToEnemy);      if (userData.controller:QueueLength() > 0) then        return Action.Status.RUNNING;    end      -- Subtract a single bullet per shot.    userData.ammo = userData.ammo - 1;    return Action.Status.TERMINATED; end SoldierLogic.lua: local function ShootAction(userData)    return Action.new(        "shoot",        SoldierActions_ShootInitialize,        SoldierActions_ShootUpdate,        SoldierActions_ShootCleanUp,        userData); end The random move action Randomly moving is an action that chooses a random point on the navmesh to be moved to. This action is very similar to other actions that move, except that this action doesn't perform the moving itself. Instead, the random move action only chooses a valid point to move to and requires the move action to perform the movement: SoldierActions.lua: function SoldierActions_RandomMoveCleanUp(userData)   end   function SoldierActions_RandomMoveInitialize(userData)    local sandbox = userData.agent:GetSandbox();      local endPoint = Sandbox.RandomPoint(sandbox, "default");    local path = Sandbox.FindPath(        sandbox,        "default",        userData.agent:GetPosition(),        endPoint);       while #path == 0 do        endPoint = Sandbox.RandomPoint(sandbox, "default");        path = Sandbox.FindPath(            sandbox,            "default",            userData.agent:GetPosition(),            endPoint);    end       userData.agent:SetPath(path);    userData.agent:SetTarget(endPoint);    userData.movePosition = endPoint;       return Action.Status.TERMINATED; end   function SoldierActions_RandomMoveUpdate(userData)    return Action.Status.TERMINATED; end SoldierLogic.lua: local function RandomMoveAction(userData)    return Action.new(        "randomMove",        SoldierActions_RandomMoveInitialize,        SoldierActions_RandomMoveUpdate,        SoldierActions_RandomMoveCleanUp,        userData); end The move action Our movement action is similar to an idle action, as the agent's walk animation will loop infinitely. In order for the agent to complete a move action, though, the agent must reach within a certain distance of its target position or timeout. In this case, we can use 1.5 meters, as that's close enough to the target position to terminate the move action and half a second to indicate how long the move action can run for: SoldierActions.lua: function SoldierActions_MoveToCleanUp(userData)    userData.moveEndTime = nil; end   function SoldierActions_MoveToInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.MOVE);       -- Since movement is a looping animation, cut off the move    -- Action after 0.5 seconds.    local sandboxTimeInMillis =        Sandbox.GetTimeInMillis(userData.agent:GetSandbox());    userData.moveEndTime = sandboxTimeInMillis + 500;      return Action.Status.RUNNING; end When applying the move action onto our agents, the indirect soldier controller will manage all animation playback and steer our agent along their path. The agent moving to a random position Setting a time limit for the move action will still allow our agents to move to their final target position, but gives other actions a chance to execute in case the situation has changed. Movement paths can be long, and it is undesirable to not handle situations such as death until the move action has terminated: SoldierActions.lua: function SoldierActions_MoveToUpdate(deltaTimeInMillis, userData)    -- Terminate the action after the allotted 0.5 seconds. The    -- decision structure will simply repath if the Agent needs    -- to move again.    local sandboxTimeInMillis =        Sandbox.GetTimeInMillis(userData.agent:GetSandbox());  if (sandboxTimeInMillis >= userData.moveEndTime) then        userData.moveEndTime = nil;        return Action.Status.TERMINATED;    end      path = userData.agent:GetPath();    if (#path ~= 0) then        offset = Vector.new(0, 0.05, 0);        DebugUtilities_DrawPath(            path, false, offset, DebugUtilities.Orange);        Core.DrawCircle(            path[#path] + offset, 1.5, DebugUtilities.Orange);    end      -- Terminate movement is the Agent is close enough to the    -- target.  if (Vector.Distance(userData.agent:GetPosition(),        userData.agent:GetTarget()) < 1.5) then          Agent.RemovePath(userData.agent);        return Action.Status.TERMINATED;    end      return Action.Status.RUNNING; end SoldierLogic.lua: local function MoveAction(userData)    return Action.new(        "move",        SoldierActions_MoveToInitialize,        SoldierActions_MoveToUpdate,        SoldierActions_MoveToCleanUp,        userData); end Summary In this article, we have taken a look at creating userdata and reuasable actions. Resources for Article: Further resources on this subject: Using Sprites for Animation [Article] Installing Gideros [Article] CryENGINE 3: Breaking Ground with Sandbox [Article]
Read more
  • 0
  • 0
  • 1650

article-image-machine-learning-examples-applicable-businesses
Packt
25 Nov 2014
7 min read
Save for later

Machine Learning Examples Applicable to Businesses

Packt
25 Nov 2014
7 min read
The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[   output == 0,   density(get(colPred), adjust = 0.5)   ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[   output == 1,   density(get(colPred), adjust = 0.5)   ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend(   'top',   c('0', '1'),   pch = 16,   col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows:   The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 1349
article-image-no-nodistinct
Packt
25 Nov 2014
4 min read
Save for later

No to nodistinct

Packt
25 Nov 2014
4 min read
This article is written by Stephen Redmond, the author of Mastering QlikView. There is a great skill in creating the right expression to calculate the right answer. Being able to do this in all circumstances relies on having a good knowledge of creating advanced expressions. Of course, the best path to mastery in this subject is actually getting out and doing it, but there is a great argument here for regularly practicing with dummy or test datasets. (For more resources related to this topic, see here.) When presented with a problem that needs to be solved, all the QlikView masters will not necessarily know immediately how to answer it. What they will have though is a very good idea of where to start, that is, what to try and what not to try. This is what I hope to impart to you here. Knowing how to create many advanced expressions will arm you to know where to apply them—and where not to apply them. This is one area of QlikView that is alien to many people. For some reason, they fear the whole idea of concepts such as Aggr. However, the reality is that these concepts are actually very simple and supremely logical. Once you get your head around them, you will wonder what all the fuss was about. No to nodistinct The Aggr function has as an optional clause, that is, the possibility of stating that the aggregation will be either distinct or nodistinct. The default option is distinct, and as such, is rarely ever stated. In this default operation, the aggregation will only produce distinct results for every combination of dimensions—just as you would expect from a normal chart or straight table. The nodistinct option only makes sense within a chart, one that has more dimensions than are in the Aggr statement. In this case, the granularity of the chart is lower than the granularity of Aggr, and therefore, QlikView will only calculate that Aggr for the first occurrence of lower granularity dimensions and will return null for the other rows. If we specify nodistinct, the same result will be calculated across all of the lower granularity dimensions. This can be difficult to understand without seeing an example, so let's look at a common use case for this option. We will start with a dataset: ProductSales:Load * Inline [Product, Territory, Year, SalesProduct A, Territory A, 2013, 100Product B, Territory A, 2013, 110Product A, Territory B, 2013, 120Product B, Territory B, 2013, 130Product A, Territory A, 2014, 140Product B, Territory A, 2014, 150Product A, Territory B, 2014, 160Product B, Territory B, 2014, 170]; We will build a report from this data using a pivot table: Now, we want to bring the value in the Total column into a new column under each year, perhaps to calculate a percentage for each year. We might think that, because the total is the sum for each Product and Territory, we might use an Aggr in the following manner: Sum(Aggr(Sum(Sales), Product, Territory)) However, as stated previously, because the chart includes an additional dimension (Year) than Aggr, the expression will only be calculated for the first occurrence of each of the lower granularity dimensions (in this case, for Year = 2013): The commonly suggested fix for this is to use Aggr without Sum and with nodistinct as shown: Aggr(NoDistinct Sum(Sales), Product, Territory) This will allow the Aggr expression to be calculated across all the Year dimension values, and at first, it will appear to solve the problem: The problem occurs when we decide to have a total row on this chart: As there is no aggregation function surrounding Aggr, it does not total correctly at the Product or Territory dimensions. We can't add an aggregation function, such as Sum, because it will break one of the other totals. However, there is something different that we can do; something that doesn't involve Aggr at all! We can use our old friend Total: Sum(Total<Product, Territory> Sales) This will calculate correctly at all the levels: There might be other use cases for using a nodistinct clause in Aggr, but they should be reviewed to see whether a simpler Total will work instead. Summary We discussed an important function, the Aggr function. We now know that the Aggr function is extremely useful, but we don't need to apply it in all circumstances where we have vertical calculations. Resources for Article: Further resources on this subject: Common QlikView script errors [article] Introducing QlikView elements [article] Creating sheet objects and starting new list using Qlikview 11 [article]
Read more
  • 0
  • 0
  • 2023

article-image-understanding-hbase-ecosystem
Packt
24 Nov 2014
11 min read
Save for later

Understanding the HBase Ecosystem

Packt
24 Nov 2014
11 min read
This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: user-subscribe@hbase.apache.org The developer list: dev-subscribe@hbase.apache.org Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode   NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker   YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]
Read more
  • 0
  • 0
  • 15533

article-image-plot-function
Packt
18 Nov 2014
17 min read
Save for later

The plot function

Packt
18 Nov 2014
17 min read
In this article by L. Felipe Martins, the author of the book, IPython Notebook Essentials, has discussed about the plot() function, which is an important aspect of matplotlib, an IPython library for production of publication-quality graphs. (For more resources related to this topic, see here.) The plot() function is the workhorse of the matplotlib library. In this section, we will explore the line-plotting and formatting capabilities included in this function. To make things a bit more concrete, let's consider the formula for logistic growth, as follows: This model is frequently used to represent growth that shows an initial exponential phase, and then is eventually limited by some factor. The examples are the population in an environment with limited resources and new products and/or technological innovations, which initially attract a small and quickly growing market but eventually reach a saturation point. A common strategy to understand a mathematical model is to investigate how it changes as the parameters defining it are modified. Let's say, we want to see what happens to the shape of the curve when the parameter b changes. To be able to do what we want more efficiently, we are going to use a function factory. This way, we can quickly create logistic models with arbitrary values for r, a, b, and c. Run the following code in a cell: def make_logistic(r, a, b, c):    def f_logistic(t):        return a / (b + c * exp(-r * t))    return f_logistic The function factory pattern takes advantage of the fact that functions are first-class objects in Python. This means that functions can be treated as regular objects: they can be assigned to variables, stored in lists in dictionaries, and play the role of arguments and/or return values in other functions. In our example, we define the make_logistic() function, whose output is itself a Python function. Notice how the f_logistic() function is defined inside the body of make_logistic() and then returned in the last line. Let's now use the function factory to create three functions representing logistic curves, as follows: r = 0.15 a = 20.0 c = 15.0 b1, b2, b3 = 2.0, 3.0, 4.0 logistic1 = make_logistic(r, a, b1, c) logistic2 = make_logistic(r, a, b2, c) logistic3 = make_logistic(r, a, b3, c) In the preceding code, we first fix the values of r, a, and c, and define three logistic curves for different values of b. The important point to notice is that logistic1, logistic2, and logistic3 are functions. So, for example, we can use logistic1(2.5) to compute the value of the first logistic curve at the time 2.5. We can now plot the functions using the following code: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues)) plot(tvalues, logistic2(tvalues)) plot(tvalues, logistic3(tvalues)) The first line in the preceding code sets the maximum time value, tmax, to be 40. Then, we define the set of times at which we want the functions evaluated with the assignment, as follows: tvalues = linspace(0, tmax, 300) The linspace() function is very convenient to generate points for plotting. The preceding code creates an array of 300 equally spaced points in the interval from 0 to tmax. Note that, contrary to other functions, such as range() and arange(), the right endpoint of the interval is included by default. (To exclude the right endpoint, use the endpoint=False option.) After defining the array of time values, the plot() function is called to graph the curves. In its most basic form, it plots a single curve in a default color and line style. In this usage, the two arguments are two arrays. The first array gives the horizontal coordinates of the points being plotted, and the second array gives the vertical coordinates. A typical example will be the following function call: plot(x,y) The variables x and y must refer to NumPy arrays (or any Python iterable values that can be converted into an array) and must have the same dimensions. The points plotted have coordinates as follows: x[0], y[0] x[1], y[1] x[2], y[2] … The preceding command will produce the following plot, displaying the three logistic curves: You may have noticed that before the graph is displayed, there is a line of text output that looks like the following: [<matplotlib.lines.Line2D at 0x7b57c50>] This is the return value of the last call to the plot() function, which is a list (or with a single element) of objects of the Line2D type. One way to prevent the output from being shown is to enter None as the last row in the cell. Alternatively, we can assign the return value of the last call in the cell to a dummy variable: _dummy_ = plot(tvalues, logistic3(tvalues)) The plot() function supports plotting several curves in the same function call. We need to change the contents of the cell that are shown in the following code and run it again: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues),      tvalues, logistic2(tvalues),      tvalues, logistic3(tvalues)) This form saves some typing but turns out to be a little less flexible when it comes to customizing line options. Notice that the text output produced now is a list with three elements: [<matplotlib.lines.Line2D at 0x9bb6cc0>, <matplotlib.lines.Line2D at 0x9bb6ef0>, <matplotlib.lines.Line2D at 0x9bb9518>] This output can be useful in some instances. For now, we will stick with using one call to plot() for each curve, since it produces code that is clearer and more flexible. Let's now change the line options in the plot and set the plot bounds. Change the contents of the cell to read as follows: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-') plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':') plot(tvalues, logistic3(tvalues),      linewidth=3.5, color=(0.0, 0.0, 0.5), linestyle='--') axis([0, tmax, 0, 11.]) None Running the preceding command lines will produce the following plots: The options set in the preceding code are as follows: The first curve is plotted with a line width of 1.5, with the HTML color of DarkGreen, and a filled-line style The second curve is plotted with a line width of 2.0, colored with the RGB value given by the hexadecimal string '#8B0000', and a dotted-line style The third curve is plotted with a line width of 3.0, colored with the RGB components, (0.0, 0.0, 0.5), and a dashed-line style Notice that there are different ways of specifying a fixed color: a HTML color name, a hexadecimal string, or a tuple of floating-point values. In the last case, the entries in the tuple represent the intensity of the red, green, and blue colors, respectively, and must be floating-point values between 0.0 and 1.0. A complete list of HTML name colors can be found at http://www.w3schools.com/html/html_colornames.asp. Editor's Tip: For more insights on colors, check out https://dgtl.link/colors Line styles are specified by a symbolic string. The allowed values are shown in the following table: Symbol string Line style '-' Solid (the default) '--' Dashed ':' Dotted '-.' Dash-dot 'None', '', or '' Not displayed After the calls to plot(), we set the graph bounds with the function call: axis([0, tmax, 0, 11.]) The argument to axis() is a four-element list that specifies, in this order, the maximum and minimum values of the horizontal coordinates, and the maximum and minimum values of the vertical coordinates. It may seem non-intuitive that the bounds for the variables are set after the plots are drawn. In the interactive mode, matplotlib remembers the state of the graph being constructed, and graphics objects are updated in the background after each command is issued. The graph is only rendered when all computations in the cell are done so that all previously specified options take effect. Note that starting a new cell clears all the graph data. This interactive behavior is part of the matplotlib.pyplot module, which is one of the components imported by pylab. Besides drawing a line connecting the data points, it is also possible to draw markers at specified points. Change the graphing commands indicated in the following code snippet, and then run the cell again: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-',      marker='o', markevery=50, markerfacecolor='GreenYellow',      markersize=10.0) plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':',      marker='s', markevery=50, markerfacecolor='Salmon',      markersize=10.0) plot(tvalues, logistic3(tvalues),      linewidth=2.0, color=(0.0, 0.0, 0.5), linestyle='--',      marker = '*', markevery=50, markerfacecolor='SkyBlue',      markersize=12.0) axis([0, tmax, 0, 11.]) None Now, the graph will look as shown in the following figure: The only difference from the previous code is that now we added options to draw markers. The following are the options we use: The marker option specifies the shape of the marker. Shapes are given as symbolic strings. In the preceding examples, we use 'o' for a circular marker, 's' for a square, and '*' for a star. A complete list of available markers can be found at http://matplotlib.org/api/markers_api.html#module-matplotlib.markers. The markevery option specifies a stride within the data points for the placement of markers. In our example, we place a marker after every 50 data points. The markercolor option specifies the color of the marker. The markersize option specifies the size of the marker. The size is given in pixels. There are a large number of other options that can be applied to lines in matplotlib. A complete list is available at http://matplotlib.org/api/artist_api.html#module-matplotlib.lines. Adding a title, labels, and a legend The next step is to add a title and labels for the axes. Just before the None line, add the following three lines of code to the cell that creates the graph: title('Logistic growth: a={:5.2f}, c={:5.2f}, r={:5.2f}'.format(a, c, r)) xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') In the first line, we call the title() function to set the title of the plot. The argument can be any Python string. In our example, we use a formatted string: title('Logistic growth: a={:5.2f}, b={:5.2f}, r={:5.2f}'.format(a, c, r)) We use the format() method of the string class. The formats are placed between braces, as in {:5.2f}, which specifies a floating-point format with five spaces and two digits of precision. Each of the format specifiers is then associated sequentially with one of the data arguments of the method. A full documentation covering the details of string formatting is available at https://docs.python.org/2/library/string.html. The axis labels are set in the calls: xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') As in the title() functions, the xlabel() and ylabel() functions accept any Python string. Note that in the '$t$' and '$N(t)=a/(b+ce^{-rt}$' strings, we use LaTeX to format the mathematical formulas. This is indicated by the dollar signs, $...$, in the string. After the addition of a title and labels, our graph looks like the following: Next, we need a way to identify each of the curves in the picture. One way to do that is to use a legend, which is indicated as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)]) The legend() function accepts a list of strings. Each string is associated with a curve in the order they are added to the plot. Notice that we are again using formatted strings. Unfortunately, the preceding code does not produce great results. The legend, by default, is placed in the top-right corner of the plot, which, in this case, hides part of the graph. This is easily fixed using the loc option in the legend function, as shown in the following code: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], loc='upper left') Running this code, we obtain the final version of our logistic growth plot, as follows: The legend location can be any of the strings: 'best', 'upper right', 'upper left', 'lower left', 'lower right', 'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'. It is also possible to specify the location of the legend precisely with the bbox_to_anchor option. To see how this works, modify the code for the legend as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], bbox_to_anchor=(0.9,0.35)) Notice that the bbox_to_anchor option, by default, uses a coordinate system that is not the same as the one we specified for the plot. The x and y coordinates of the box in the preceding example are interpreted as a fraction of the width and height, respectively, of the whole figure. A little trial-and-error is necessary to place the legend box precisely where we want it. Note that the legend box can be placed outside the plot area. For example, try the coordinates (1.32,1.02). The legend() function is quite flexible and has quite a few other options that are documented at http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend. Text and annotations In this subsection, we will show how to add annotations to plots in matplotlib. We will build a plot demonstrating the fact that the tangent to a curve must be horizontal at the highest and lowest points. We start by defining the function associated with the curve and the set of values at which we want the curve to be plotted, which is shown in the following code: f = lambda x: (x**3 - 6*x**2 + 9*x + 3) / (1 + 0.25*x**2) xvalues = linspace(0, 5, 200) The first line in the preceding code uses a lambda expression to define the f() function. We use this approach here because the formula for the function is a simple, one-line expression. The general form of a lambda expression is as follows: lambda <arguments> : <return expression> This expression by itself creates an anonymous function that can be used in any place that a function object is expected. Note that the return value must be a single expression and cannot contain any statements. The formula for the function may seem unusual, but it was chosen by trial-and-error and a little bit of calculus so that it produces a nice graph in the interval from 0 to 5. The xvalues array is defined to contain 200 equally spaced points on this interval. Let's create an initial plot of our curve, as shown in the following code: plot(xvalues, f(xvalues), lw=2, color='FireBrick') axis([0, 5, -1, 8]) grid() xlabel('$x$') ylabel('$f(x)$') title('Extreme values of a function') None # Prevent text output Most of the code in this segment is explained in the previous section. The only new bit is that we use the grid() function to draw a grid. Used with no arguments, the grid coincides with the tick marks on the plot. As everything else in matplotlib, grids are highly customizable. Check the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.grid. When the preceding code is executed, the following plot is produced: Note that the curve has a highest point (maximum) and a lowest point (minimum). These are collectively called the extreme values of the function (on the displayed interval, this function actually grows without bounds as x becomes large). We would like to locate these on the plot with annotations. We will first store the relevant points as follows: x_min = 3.213 f_min = f(x_min) x_max = 0.698 f_max = f(x_max) p_min = array([x_min, f_min]) p_max = array([x_max, f_max]) print p_min print p_max The variables, x_min and f_min, are defined to be (approximately) the coordinates of the lowest point in the graph. Analogously, x_max and f_max represent the highest point. Don't be concerned with how these points were found. For the purposes of graphing, even a rough approximation by trial-and-error would suffice. Now, add the following code to the cell that draws the plot, right below the title() command, as shown in the following code: arrow_props = dict(facecolor='DimGray', width=3, shrink=0.05,              headwidth=7) delta = array([0.1, 0.1]) offset = array([1.0, .85]) annotate('Maximum', xy=p_max+delta, xytext=p_max+offset,          arrowprops=arrow_props, verticalalignment='bottom',          horizontalalignment='left', fontsize=13) annotate('Minimum', xy=p_min-delta, xytext=p_min-offset,          arrowprops=arrow_props, verticalalignment='top',          horizontalalignment='right', fontsize=13) Run the cell to produce the plot shown in the following diagram: In the code, start by assigning the variables arrow_props, delta, and offset, which will be used to set the arguments in the calls to annotate(). The annotate() function adds a textual annotation to the graph with an optional arrow indicating the point being annotated. The first argument of the function is the text of the annotation. The next two arguments give the locations of the arrow and the text: xy: This is the point being annotated and will correspond to the tip of the arrow. We want this to be the maximum/minimum points, p_min and p_max, but we add/subtract the delta vector so that the tip is a bit removed from the actual point. xytext: This is the point where the text will be placed as well as the base of the arrow. We specify this as offsets from p_min and p_max using the offset vector. All other arguments of annotate() are formatting options: arrowprops: This is a Python dictionary containing the arrow properties. We predefine the dictionary, arrow_props, and use it here. Arrows can be quite sophisticated in matplotlib, and you are directed to the documentation for details. verticalalignment and horizontalalignment: These specify how the arrow should be aligned with the text. fontsize: This signifies the size of the text. Text is also highly configurable, and the reader is directed to the documentation for details. The annotate() function has a huge number of options; for complete details of what is available, users should consult the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.annotate for the full details. We now want to add a comment for what is being demonstrated by the plot by adding an explanatory textbox. Add the following code to the cell right after the calls to annotate(): bbox_props = dict(boxstyle='round', lw=2, fc='Beige') text(2, 6, 'Maximum and minimum pointsnhave horizontal tangents',      bbox=bbox_props, fontsize=12, verticalalignment='top') The text()function is used to place text at an arbitrary position of the plot. The first two arguments are the position of the textbox, and the third argument is a string containing the text to be displayed. Notice the use of 'n' to indicate a line break. The other arguments are configuration options. The bbox argument is a dictionary with the options for the box. If omitted, the text will be displayed without any surrounding box. In the example code, the box is a rectangle with rounded corners, with a border width of 2 pixels and the face color, beige. As a final detail, let's add the tangent lines at the extreme points. Add the following code: plot([x_min-0.75, x_min+0.75], [f_min, f_min],      color='RoyalBlue', lw=3) plot([x_max-0.75, x_max+0.75], [f_max, f_max],      color='RoyalBlue', lw=3) Since the tangents are segments of straight lines, we simply give the coordinates of the endpoints. The reason to add the code for the tangents at the top of the cell is that this causes them to be plotted first so that the graph of the function is drawn at the top of the tangents. This is the final result: The examples we have seen so far only scratch the surface of what is possible with matplotlib. The reader should read the matplotlib documentation for more examples. Summary In this article, we learned how to use matplotlib to produce presentation-quality plots. We covered two-dimensional plots and how to set plot options, and annotate and configure plots. You also learned how to add labels, titles, and legends. Edited on July 27, 2018 to replace a broken reference link. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] SciPy for Computational Geometry [Article] Fast Array Operations with NumPy [Article]
Read more
  • 0
  • 0
  • 10842
article-image-hbases-data-storage
Packt
13 Nov 2014
9 min read
Save for later

The HBase's Data Storage

Packt
13 Nov 2014
9 min read
In this article by Nishant Garg author of HBase Essentials, we will look at HBase's data storage from its architectural view point. (For more resources related to this topic, see here.) For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security and so on). Let's start with data storage in HBase first. Data storage In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage: HLog (the write-ahead log file, also known as WAL) HFile (the real data storage file) In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables. In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table is removed. The .META table is renamed as hbase:meta. Now, the location of .META is stored in Zookeeper. The following is the structure of the hbase:meta table. Key—the region key of the format ([table],[region start key],[region id]). A region with an empty start key is the first region in a table. The values are as follows: info:regioninfo(serialized the HRegionInfo instance for this region) info:server(server:port of the RegionServer containing this region) info:serverstartcode(start time of the RegionServer process that contains this region) When the table is split, two new columns will be created as info:splitA and info:splitB. These columns represent the two newly created regions. The values for these columns are also serialized as HRegionInfo instances. Once the split process is complete, the row that contains the old region information is deleted. In the case of data reading, the client application first connects to ZooKeeper and looks up the location of the hbase:meta table. For the next client, the HTable instance queries the hbase:meta table and finds out the region that contains the rows of interest and also locate the region server that is serving the identified region. The information about the region and region server is then cached by the client application for future interactions and avoids the lookup process. If the region is reassigned by the load balancer process or if the region server has expired, fresh lookup is done on the hbase:meta catalog table to get the new location of the user table region and cache is updated accordingly. At the object level, the HRegionServer class is responsible to create a connection with the region by creating HRegion objects. This HRegion instance sets up a store instance that has one or more StoreFile instances (wrapped around HFile) and MemStore. MemStore accumulates the data edits as it happens and buffers them into the memory. This is also important for accessing the recent edits of table data. As shown in the preceding diagram, the HRegionServer instance (the region server) contains the map of HRegion instances (regions) and also has an HLog instance that represents the WAL. There is a single block cache instance at the region-server level, which holds data from all the regions hosted on that region server. A block cache instance is created at the time of the region server startup and it can have an implementation of LruBlockCache, SlabCache, or BucketCache. The block cache also supports multilevel caching; that is, a block cache might have first-level cache, L1, as LruBlockCache and second-level cache, L2, as SlabCache or BucketCache. All these cache implementations have their own way of managing the memory; for example, LruBlockCache is like a data structure and resides on the JVM heap whereas the other two types of implementation also use memory outside of the JVM heap. HLog (the write-ahead log – WAL) In the case of writing the data, when the client calls HTable.put(Put), the data is first written to the write-ahead log file (which contains actual data and sequence number together represented by the HLogKey class) and also written in MemStore. Writing data directly into MemStrore can be dangerous as it is a volatile in-memory buffer and always open to the risk of losing data in case of a server failure. Once MemStore is full, the contents of the MemStore are flushed to the disk by creating a new HFile on the HDFS. While inserting data from the HBase shell, the flush command can be used to write the in-memory (memstore) data to the store files. If there is a server failure, the WAL can effectively retrieve the log to get everything up to where the server was prior to the crash failure. Hence, the WAL guarantees that the data is never lost. Also, as another level of assurance, the actual write-ahead log resides on the HDFS, which is a replicated filesystem. Any other server having a replicated copy can open the log. The HLog class represents the WAL. When an HRegion object is instantiated, the single HLog instance is passed on as a parameter to the constructor of HRegion. In the case of an update operation, it saves the data directly to the shared WAL and also keeps track of the changes by incrementing the sequence numbers for each edits. WAL uses a Hadoop SequenceFile, which stores records as sets of key-value pairs. Here, the HLogKey instance represents the key, and the key-value represents the rowkey, column family, column qualifier, timestamp, type, and value along with the region and table name where data needs to be stored. Also, the structure starts with two fixed-length numbers that indicate the size and value of the key. The following diagram shows the structure of a key-value pair: The WALEdit class instance takes care of atomicity at the log level by wrapping each update. For example, in the case of a multicolumn update for a row, each column is represented as a separate KeyValue instance. If the server fails after updating few columns to the WAL, it ends up with only a half-persisted row and the remaining updates are not persisted. Atomicity is guaranteed by wrapping all updates that comprise multiple columns into a single WALEdit instance and writing it in a single operation. For durability, a log writer's sync() method is called, which gets the acknowledgement from the low-level filesystem on each update. This method also takes care of writing the WAL to the replication servers (from one datanode to another). The log flush time can be set to as low as you want, or even be kept in sync for every edit to ensure high durability but at the cost of performance. To take care of the size of the write ahead log file, the LogRoller instance runs as a background thread and takes care of rolling log files at certain intervals (the default is 60 minutes). Rolling of the log file can also be controlled based on the size and hbase.regionserver.logroll.multiplier. It rotates the log file when it becomes 90 percent of the block size, if set to 0.9. HFile (the real data storage file) HFile represents the real data storage file. The files contain a variable number of data blocks and fixed number of file info blocks and trailer blocks. The index blocks records the offsets of the data and meta blocks. Each data block contains a magic header and a number of serialized KeyValue instances. The default size of the block is 64 KB and can be as large as the block size. Hence, the default block size for files in HDFS is 64 MB, which is 1,024 times the HFile default block size but there is no correlation between these two blocks. Each key-value in the HFile is represented as a low-level byte array. Within the HBase root directory, we have different files available at different levels. Write-ahead log files represented by the HLog instances are created in a directory called WALs under the root directory defined by the hbase.rootdir property in hbase-site.xml. This WALs directory also contains a subdirectory for each HRegionServer. In each subdirectory, there are several write-ahead log files (because of log rotation). All regions from that region server share the same HLog files. In HBase, every table also has its own directory created under the data/default directory. This data/default directory is located under the root directory defined by the hbase.rootdir property in hbase-site.xml. Each table directory contains a file called .tableinfo within the .tabledesc folder. This .tableinfo file stores the metadata information about the table, such as table and column family schemas, and is represented as the serialized HTableDescriptor class. Each table directory also has a separate directory for every region comprising the table, and the name of this directory is created using the MD5 hash portion of a region name. The region directory also has a .regioninfo file that contains the serialized information of the HRegionInfo instance for the given region. Once the region exceeds the maximum configured region size, it splits and a matching split directory is created within the region directory. This size is configured using the hbase.hregion.max.filesize property or the configuration done at the column-family level using the HColumnDescriptor instance. In the case of multiple flushes by the MemStore, the number of files might get increased on this disk. The compaction process running in the background combines the files to the largest configured file size and also triggers region split. Summary In this article, we have learned about the internals of HBase and how it stores the data. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Advanced Hadoop MapReduce Administration [Article] HBase Administration, Performance Tuning [Article]
Read more
  • 0
  • 0
  • 5620

article-image-postmodel-workflow
Packt
04 Nov 2014
23 min read
Save for later

Postmodel Workflow

Packt
04 Nov 2014
23 min read
 This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes: K-fold cross validation Automatic cross validation Cross validation with ShuffleSplit Stratified k-fold Poor man's grid search Brute force grid search Using dummy estimators to compare results (For more resources related to this topic, see here.) Even though by design the articles are unordered, you could argue by virtue of the art of data science, we've saved the best for last. For the most part, each recipe within this article is applicable to the various models we've worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the "best" model. We'll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting. K-fold cross validation In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes. Getting ready We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters. How to do it... First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset: >>> N = 1000>>> holdout = 200>>> from sklearn.datasets import make_regression>>> X, y = make_regression(1000, shuffle=True) Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would: >>> X_h, y_h = X[:holdout], y[:holdout]>>> X_t, y_t = X[holdout:], y[holdout:]>>> from sklearn.cross_validation import KFold K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True. Let's create the cross validation object: >>> kfold = KFold(len(y_t), n_folds=4) Now, we can iterate through the k-fold object: >>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(kfold):       print output_string.format(i, len(y_t[train]),       len(y_t[test]))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Each iteration should return the same split size. How it works... It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t). From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures. We're going to mix it up and use pandas for this part: >>> import numpy as np>>> import pandas as pd>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)>>> measurements = pd.DataFrame({'patient_id': patients,                   'ys': np.random.normal(0, 1, 800)}) Now that we have the data, we only want to hold out certain customers instead of data points: >>> custids = np.unique(measurements.patient_id)>>> customer_kfold = KFold(custids.size, n_folds=4)>>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(customer_kfold):       train_cust_ids = custids[train]       training = measurements[measurements.patient_id.isin(                 train_cust_ids)]       testing = measurements[~measurements.patient_id.isin(                 train_cust_ids)]       print output_string.format(i, len(training), len(testing))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Automatic cross validation We've looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance. Getting ready First, we'll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it'll be a random forest. We'll then create a dataset and use the cross validation functions. How to do it... First import the ensemble module and we'll get started: >>> from sklearn import ensemble>>> rf = ensemble.RandomForestRegressor(max_features='auto') Okay, so now, let's create some regression data: >>> from sklearn import datasets>>> X, y = datasets.make_regression(10000, 10) Now that we have the data, we can import the cross_validation module and get access to the functions we'll use: >>> from sklearn import cross_validation>>> scores = cross_validation.cross_val_score(rf, X, y)>>> print scores[ 0.86823874 0.86763225 0.86986129] How it works... For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel. We can activate verbose mode play by play: >>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)[CV] no parameters to be set[CV] no parameters to be set, score=0.872866 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.873679 - 0.6s[CV] no parameters to be set[CV] no parameters to be set, score=0.878018 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.871598 - 0.6s[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs. It's also worth knowing that we can score our function predicated on which kind of model we're trying to fit. Cross validation with ShuffleSplit ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified. Getting ready ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation. How to do it... First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean: >>> import numpy as np>>> true_loc = 1000>>> true_scale = 10>>> N = 1000>>> dataset = np.random.normal(true_loc, true_scale, N)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');>>> ax.set_title("Histogram of dataset");>>> f.savefig("978-1-78398-948-5_06_06.png") NumPy will give the following output: Now, let's take the first half of the data and guess the mean: >>> from sklearn import cross_validation>>> holdout_set = dataset[:500]>>> fitting_set = dataset[500:]>>> estimate = fitting_set[:N/2].mean()>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.set_title("True Mean vs Regular Estimate")>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.set_xlim(999, 1001)>>> ax.legend()>>> f.savefig("978-1-78398-948-5_06_07.png") We'll get the following output: Now, we can use ShuffleSplit to fit the estimator on several smaller datasets: >>> from sklearn.cross_validation import ShuffleSplit>>> shuffle_split = ShuffleSplit(len(fitting_set))>>> mean_p = []>>> for train, _ in shuffle_split:       mean_p.append(fitting_set[train].mean())       shuf_estimate = np.mean(mean_p)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5,             alpha=.65, label='shufflesplit estimate')>>> ax.set_title("All Estimates")>>> ax.set_xlim(999, 1001)>>> ax.legend(loc=3) The output will be as follows: As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate. Stratified k-fold In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions. Getting ready We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal. We'll then plot the class proportions at each step to illustrate how the class proportions are maintained: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11]) Let's check the overall class weight distribution: >>> y.mean()0.90300000000000002 Roughly, 90.5 percent of the samples are 1, with the balance 0. How to do it... Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit: >>> from sklearn import cross_validation>>> n_folds = 50>>> strat_kfold = cross_validation.StratifiedKFold(y,                 n_folds=n_folds)>>> shuff_split = cross_validation.ShuffleSplit(n=len(y),                 n_iter=n_folds)>>> kfold_y_props = []>>> shuff_y_props = []>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold,         shuff_split):        kfold_y_props.append(y[k_train].mean())       shuff_y_props.append(y[s_train].mean()) Now, let's plot the proportions over each fold: >>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit",           color='k')>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified",           color='k', ls='--')>>> ax.set_title("Comparing class proportions.")>>> ax.legend(loc='best') The output will be as follows: We can see that the proportion of each fold for stratified k-fold is stable across folds. How it works... Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels: >>> import numpy as np>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5],                   size=1000)>>> import itertools as it>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):       print np.bincount(three_classes[train])[ 0 90 314 395][ 0 90 314 395][ 0 90 314 395][ 0 91 315 395][ 0 91 315 396] As we can see, we got roughly the sample sizes of each class for our training and testing proportions. Poor man's grid search In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization. Getting ready In this recipe, we will perform the following tasks: Design a basic search grid in the parameter space Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset Choose the point in the parameter space that minimizes/maximizes the evaluation function Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization: The parameter space will then be the Cartesian product of the those two sets: We'll see in a bit how we can iterate through this space with itertools. Let's create the dataset and then get started: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=2000, n_features=10) How to do it... Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them: >>> criteria = {'gini', 'entropy'}>>> max_features = {'auto', 'log2', None}>>> import itertools as it>>> parameter_space = it.product(criteria, max_features) Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50: import numpy as nptrain_set = np.random.choice([True, False], size=len(y))from sklearn.tree import DecisionTreeClassifieraccuracies = {}for criterion, max_feature in parameter_space:   dt = DecisionTreeClassifier(criterion=criterion,         max_features=max_feature)   dt.fit(X[train_set], y[train_set])   accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set])                                         == y[~train_set]).mean()>>> accuracies{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini','auto'): 0.9638671875, ('gini', 'log2'): 0.96875} So we now have the accuracies and its performance. Let's visualize the performance: >>> from matplotlib import pyplot as plt>>> from matplotlib import cm>>> cmap = cm.RdBu_r>>> f, ax = plt.subplots(figsize=(7, 4))>>> ax.set_xticklabels([''] + list(criteria))>>> ax.set_yticklabels([''] + list(max_features))>>> plot_array = []>>> for max_feature in max_features:m = []>>> for criterion in criteria:       m.append(accuracies[(criterion, max_feature)])       plot_array.append(m)>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values())             - 0.001, vmax=np.max(accuracies.values()) + 0.001,             cmap=cmap)>>> f.colorbar(colors) The following is the output: It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method. How it works... This works fairly simply, we just have to perform the following steps: Choose a set of parameters. Iterate through them and find the accuracy of each step. Find the best performer by visual inspection. Brute force grid search In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods. We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space. Getting ready To get started, we'll need to perform the following steps: Create some classification data. We'll then create a LogisticRegression object that will be the model we're fitting. After that, we'll create the search objects, GridSearch and RandomizedSearchCV. How to do it... Run the following code to create some classification data: >>> from sklearn.datasets import make_classification>>> X, y = make_classification(1000, n_features=5) Now, we'll create our logistic regression object: >>> from sklearn.linear_model import LogisticRegression>>> lr = LogisticRegression(class_weight='auto') We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample: >>> lr.fit(X, y)LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75},                   dual=False,fit_intercept=True,                  intercept_scaling=1, penalty='l2',                   random_state=None, tol=0.0001)>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]} The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution: >>> import scipy.stats as st>>> import numpy as np>>> random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)} How it works... Now, we'll fit the classifier. This works by passing lr to the parameter search objects: >>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV>>> gs = GridSearchCV(lr, grid_search_params) GridSearchCV implements the same API as the other models: >>> gs.fit(X, y)GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0,             class_weight='auto', dual=False, fit_intercept=True,             intercept_scaling=1, penalty='l2', random_state=None,             tol=0.0001), fit_params={}, iid=True, loss_func=None,             n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C':             [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True,             score_func=None, scoring=None, verbose=0) As we can see with the param_grid parameter, our penalty and C are both arrays. To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search: >>> gs.grid_scores_[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}] We might want to get the max score: >>> gs.grid_scores_[1][1]0.90100000000000002>>> max(gs.grid_scores_, key=lambda x: x[1])mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1} The parameters obtained are the best choices for our logistic regression. Using dummy estimators to compare results This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build. Getting ready In this recipe, we'll perform the following tasks: Create some data random data. Fit the various dummy estimators. We'll perform these two steps for regression data and classification data. How to do it... First, we'll create the random data: >>> from sklearn.datasets import make_regression, make_classification# classification if for later>>> X, y = make_regression()>>> from sklearn import dummy>>> dumdum = dummy.DummyRegressor()>>> dumdum.fit(X, y)DummyRegressor(constant=None, strategy='mean') By default, the estimator will predict by just taking the mean of the values and predicting the mean values: >>> dumdum.predict(X)[:5]array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907]) There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value. Supplying a constant will only be considered if strategy is "constant". Let's have a look: >>> predictors = [("mean", None),                 ("median", None),                 ("constant", 10)]>>> for strategy, constant in predictors:       dumdum = dummy.DummyRegressor(strategy=strategy,                 constant=constant)>>> dumdum.fit(X, y)>>> print "strategy: {}".format(strategy), ",".join(map(str,         dumdum.predict(X)[:5]))strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0 We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems: >>> predictors = [("constant", 0),                 ("stratified", None),                 ("uniform", None),                 ("most_frequent", None)] We'll also need to create some classification data: >>> X, y = make_classification()>>> for strategy, constant in predictors:       dumdum = dummy.DummyClassifier(strategy=strategy,                 constant=constant)       dumdum.fit(X, y)       print "strategy: {}".format(strategy), ",".join(map(str,             dumdum.predict(X)[:5]))strategy: constant 0,0,0,0,0strategy: stratified 1,0,0,1,0strategy: uniform 0,0,0,1,1strategy: most_frequent 1,1,1,1,1 How it works... It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud. We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems: >>> X, y = make_classification(20000, weights=[.95, .05])>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')>>> dumdum.fit(X, y)DummyClassifier(constant=None, random_state=None, strategy='most_frequent')>>> from sklearn.metrics import accuracy_score>>> print accuracy_score(y, dumdum.predict(X))0.94575 We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. Summary This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine Learning in IPython with scikit-learn [article] Our First Machine Learning Method – Linear Classification [article]
Read more
  • 0
  • 0
  • 2245
Modal Close icon
Modal Close icon