Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-machine-learning-models
Packt
16 Aug 2017
8 min read
Save for later

Machine Learning Models

Packt
16 Aug 2017
8 min read
In this article by Pratap Dangeti, the author of the book Statistics for Machine Learning, we will take a look at ridge regression and lasso regression in machine learning. (For more resources related to this topic, see here.) Ridge regression and lasso regression In linear regression only residual sum of squares (RSS) are minimized, whereas in ridge and lasso regression, penalty applied (also known as shrinkage penalty) on coefficient values to regularize the coefficients with the tuning parameter λ. When λ=0 penalty has no impact, ridge/lasso produces the same result as linear regression, whereas λ => ∞ will bring coefficients to zero. Before we go in deeper on ridge and lasso, it is worth to understand some concepts on Lagrangian multipliers. One can show the preceding objective function into the following format, where objective is just RSS subjected to cost constraint (s) of budget. For every value of λ, there is some s such that will provide the equivalent equations as shown as follows for overall objective function with penalty factor: The following graph shows the two different Lagrangian format: Ridge regression works well in situations where the least squares estimates have high variance. Ridge regression has computational advantages over best subset selection which required 2P models. In contrast for any fixed value of λ, ridge regression only fits a single model and model-fitting procedure can be performed very quickly. One disadvantage of ridge regression is, it will include all the predictors and shrinks the weights according with its importance but it does not set the values exactly to zero in order to eliminate unnecessary predictors from models, this issue will be overcome in lasso regression. During the situation of number of predictors are significantly large, using ridge may provide good accuracy but it includes all the variables, which is not desired in compact representation of the model, this issue do not present in lasso as it will set the weights of unnecessary variables to zero. Model generated from lasso are very much like subset selection, hence it is much easier to interpret than those produced by ridge regression. Example of ridge regression machine learning model Ridge regression is machine learning model, in which we do not perform any statistical diagnostics on the independent variables and just utilize the model to fit on test data and check the accuracy of fit. Here we have used scikit-learn package: >>> from sklearn.linear_model import Ridge >>> wine_quality = pd.read_csv( >>> wine_quality.rename(columns=lambda x: x.replace(" ", inplace=True) >>> all_colnms = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', Article_01.png λ, ridge regression only fits a idge exactly to zero in egression. are significantly large, idge variables, which is not model, this issue do not present in lasso as it will e regression machine learning model learning model, in which we do not perform any statistical the model to fit on test data and have used scikit-learn package: csv("winequality-red.csv",sep=';') "_"), in ariables, asso odel earning el 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] >>> pdx = wine_quality[all_colnms] >>> pdy = wine_quality["quality"] >>> x_train,x_test,y_train,y_test = train_test_split(pdx,pdy,train_size = 0.7,random_state=42) Simple version of grid search from scratch has been described as follows, in which various values of alphas are tried to be tested in grid search to test the model fitness: >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] Initial values of R-squared are set to zero in order to keep track on its updated value and to print whenever new value is greater than exiting value: >>> initrsq = 0 >>> print ("nRidge Regression: Best Parametersn") >>> for alph in alphas: ... ridge_reg = Ridge(alpha=alph) ... ridge_reg.fit(x_train,y_train) 0 ... tr_rsqrd = ridge_reg.score(x_train,y_train) ... ts_rsqrd = ridge_reg.score(x_test,y_test) The following code always keep track on test R-squared value and prints if new value is greater than existing best value: >>> if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is is shown in the following screenshot: By looking into test R-squared (0.3513) value we can conclude that there is no significant relationship between independent and dependent variables. Also, please note that, the test R-squared value generated from ridge regression is similar to value obtained from multiple linear regression (0.3519), but with the no stress on diagnostics of variables, and so on. Hence machine learning models are relatively compact and can be utilized for learning automatically without manual intervention to retrain the model, this is one of the biggest advantages of using ML models for deployment purposes. The R code for ridge regression on wine quality data is shown as follows: # Ridge regression library(glmnet) wine_quality = read.csv("winequality-red.csv",header=TRUE,sep = ";",check.names = FALSE) names(wine_quality) <- gsub(" ", "_", names(wine_quality)) set.seed(123) numrow = nrow(wine_quality) trnind = sample(1:numrow,size = as.integer(0.7*numrow)) train_data = wine_quality[trnind,]; test_data = wine_quality[- trnind,] xvars = c("fixed_acidity","volatile_acidity","citric_acid","residual_sugar ","chlorides","free_sulfur_dioxide", "total_sulfur_dioxide","density","pH","sulphates","alcohol") yvar = "quality" x_train = as.matrix(train_data[,xvars]);y_train = as.double (as.matrix (train_data[,yvar])) x_test = as.matrix(test_data[,xvars]) print(paste("Ridge Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ ridge_fit = glmnet(x_train,y_train,alpha = 0,lambda = lmbd) pred_y = predict(ridge_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Example of lasso regression model Lasso regression is close cousin of ridge regression, in which absolute values of coefficients are minimized rather than square of values. By doing so, we eliminate some insignificant variables, which are very much compacted representation similar to OLS methods. Following implementation is almost similar to ridge regression apart from penalty application on mod/absolute value of coefficients: >>> from sklearn.linear_model import Lasso >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] >>> initrsq = 0 >>> print ("nLasso Regression: Best Parametersn") >>> for alph in alphas: ... lasso_reg = Lasso(alpha=alph) ... lasso_reg.fit(x_train,y_train) ... tr_rsqrd = lasso_reg.score(x_train,y_train) ... ts_rsqrd = lasso_reg.score(x_test,y_test) ... if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is shown in the following screenshot: Lasso regression produces almost similar results as ridge, but if we check the test R-squared values bit carefully, lasso produces little less values. Reason behind the same could be due to its robustness of reducing coefficients to zero and eliminate them from analysis: >>> ridge_reg = Ridge(alpha=0.001) >>> ridge_reg.fit(x_train,y_train) >>> print ("nRidge Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",ridge_reg.coef_[i]) >>> lasso_reg = Lasso(alpha=0.001) >>> lasso_reg.fit(x_train,y_train) >>> print ("nLasso Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",lasso_reg.coef_[i]) Following results shows the coefficient values of both the methods, coefficient of density has been set to o in lasso regression whereas density value is -5.5672 in ridge regression; also none of the coefficients in ridge regression are zero values: R Code – Lasso Regression on Wine Quality Data # Above Data processing steps are same as Ridge Regression, only below section of the code do change # Lasso Regression print(paste("Lasso Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ lasso_fit = glmnet(x_train,y_train,alpha = 1,lambda = lmbd) pred_y = predict(lasso_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Regularization parameters in linear regression and ridge/lasso regression Adjusted R-squared in linear regression always penalizes adding extra variables with less significance is one type of regularizing the data in linear regression, but it will adjust to unique fit of the model. Whereas in machine learning many parameters are adjusted to regularizing the overfitting problem, in the example of lasso/ridge regression penalty parameter (λ) to regularization, there are infinite values can be applied to regularize the model infinite ways: In overall there are many similarities between statistical way and machine learning ways of predicting the pattern. Summary We have seen ridge regression and lasso regression with their examples and we have also seen its regularization parameters. Resources for Article: Further resources on this subject: Machine Learning Review [article] Getting Started with Python and Machine Learning [article] Machine learning in practice [article]
Read more
  • 0
  • 0
  • 3492

article-image-eu-approves-labour-protection-laws-for-whistleblowers-and-gig-economy-workers-with-implications-for-tech-companies
Savia Lobo
17 Apr 2019
5 min read
Save for later

EU approves labour protection laws for ‘Whistleblowers’ and ‘Gig economy’ workers with implications for tech companies

Savia Lobo
17 Apr 2019
5 min read
The European Union approved two new labour protection laws recently. This time, for the two not so hyped sects, the whistleblowers and the ones earning their income via the ‘gig economy’. As for the whistleblowers, with the new law, they receive an increased protection landmark legislation aimed at encouraging reports of wrongdoing. On the other hand, for those working for ‘on-demand’ jobs, thus, termed as the gig economy, the law sets minimum rights and demands increased transparency for such workers. Let’s have a brief look at each of the newly approved by the EU. Whistleblowers’ shield against retaliation On Tuesday, the EU parliament approved a new law for whistleblowers safeguarding them from any retaliation within an organization. The law protects whistleblowers against dismissal, demotion and other forms of punishment. “The law now needs to be approved by EU ministers. Member states will then have two years to comply with the rules”, the EU proposal states. Transparency International calls this as “pathbreaking legislation”, which will also give employees a "greater legal certainty around their rights and obligations". The new law creates a safe channel which allows the whistleblowers to report of an EU law breach both within an organization and to public authorities. “It is the first time whistleblowers have been given EU-wide protection. The law was approved by 591 votes, with 29 votes against and 33 abstentions”, the BBC reports. In cases where no appropriate action is taken by the organization’s authorities even after reporting, whistleblowers are allowed to make public disclosure of the wrongdoing by communicating with the media. European Commission Vice President, Frans Timmermans, says, “potential whistleblowers are often discouraged from reporting their concerns or suspicions for fear of retaliation. We should protect whistleblowers from being punished, sacked, demoted or sued in court for doing the right thing for society.” He further added, “This will help tackle fraud, corruption, corporate tax avoidance and damage to people's health and the environment.” “The European Commission says just 10 members - France, Hungary, Ireland, Italy, Lithuania, Malta, the Netherlands, Slovakia, Sweden, and the UK - had a "comprehensive law" protecting whistleblowers”, the BBC reports. “Attempts by some states to water down the reform earlier this year were blocked at an early stage of the talks with Luxembourg, Ireland, and Hungary seeking to have tax matters excluded. However, a coalition of EU states, including Germany, France, and Italy, eventually prevailed in keeping tax revelations within the proposal”, the Reuters report. “If member states fail to properly implement the law, the European Commission can take formal disciplinary steps against the country and could ultimately refer the case to the European Court of Justice”, BBC reports. To know more about this new law for whistleblowers, read the official proposal. EU grants protection to workers in Gig economy (casual or short-term employment) In a vote on Tuesday, the Members of the European Parliament (MEP) announced minimum rights for workers with on-demand, voucher-based or platform jobs, such as Uber or Deliveroo. However, genuinely self-employed workers would be excluded from the new rules. “The law states that every person who has an employment contract or employment relationship as defined by law, collective agreements or practice in force in each member state should be covered by these new rights”, BBC reports. “This would mean that workers in casual or short-term employment, on-demand workers, intermittent workers, voucher-based workers, platform workers, as well as paid trainees and apprentices, deserve a set of minimum rights, as long as they meet these criteria and pass the threshold of working 3 hours per week and 12 hours per 4 weeks on average”, according to EU’s official website. For this, all workers need to be informed from day one as a general principle, but no later than seven days where justified. Following are the specific set of rights to cover new forms of employment includes: Workers with on-demand contracts or similar forms of employment should benefit from a minimum level of predictability such as predetermined reference hours and reference days. They should also be able to refuse, without consequences, an assignment outside predetermined hours or be compensated if the assignment was not cancelled in time. Member states shall adopt measures to prevent abusive practices, such as limits to the use and duration of the contract. The employer should not prohibit, penalize or hinder workers from taking jobs with other companies if this falls outside the work schedule established with that employer. Enrique Calvet Chambon, the MEP responsible for seeing the law through, said, “This directive is the first big step towards the implementation of the European Pillar of Social Rights, affecting all EU workers. All workers who have been in limbo will now be granted minimum rights thanks to this directive, and the European Court of Justice rulings, from now on no employer will be able to abuse the flexibility in the labour market.” To know more about this new law on Gig economy, visit EU’s official website. 19 nations including The UK and Germany give thumbs-up to EU’s Copyright Directive Facebook discussions with the EU resulted in changes of its terms and services for users The EU commission introduces guidelines for achieving a ‘Trustworthy AI’
Read more
  • 0
  • 0
  • 3482

article-image-learning-how-classify-real-world-examples
Packt
22 Aug 2013
24 min read
Save for later

Learning How to Classify with Real-world Examples

Packt
22 Aug 2013
24 min read
(For more resources related to this topic, see here.) The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. The following four attributes of each plant were measured: Sepal length Sepal width Petal length Petal width In general, we will call any measurement from our data as features. Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements? This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the examples of spam and ham (non-spam e-mail) that the user gave the system, can we determine whether a new, incoming message is spam or not? For the moment, the Iris dataset serves our purposes well. It is small (150 examples, 4 features each) and can easily be visualized and manipulated. The first step is visualization Because this dataset is so small, we can easily plot all of the points and all two-dimensional projections on a page. We will thus build intuitions that can then be extended to datasets with many more dimensions and datapoints. Each subplot in the following screenshot shows all the points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are indicated with "x" marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. We are using Matplotlib; it is the most well-known plotting package for Python. We present the code to generate the top-left plot. The code for the other plots is similar to the following code: from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np # We load the data with load_iris from sklearn data = load_iris() features = data['data'] feature_names = data['feature_names'] target = data['target'] for t,marker,c in zip(xrange(3),">ox","rgb"): # We plot each class on its own to get different colored markers plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c) Building our first classification model If the goal is to separate the three types of flower, we can immediately make a few suggestions. For example, the petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cutoff is as follows: plength = features[:, 2] # use numpy operations to get setosa features is_setosa = (labels == 'setosa') # This is the important step: max_setosa =plength[is_setosa].max() min_non_setosa = plength[~is_setosa].min() print('Maximum of setosa: {0}.'.format(max_setosa)) print('Minimum of others: {0}.'.format(min_non_setosa)) This prints 1.9 and 3.0. Therefore, we can build a simple model: if the petal length is smaller than two, this is an Iris Setosa flower; otherwise, it is either Iris Virginica or Iris Versicolor. if features[:,2] < 2: print 'Iris Setosa' else: print 'Iris Virginica or Iris Versicolour' This is our first model, and it works very well in that it separates the Iris Setosa flowers from the other two species without making any mistakes. What we had here was a simple structure; a simple threshold on one of the dimensions. Then we searched for the best dimension threshold. We performed this visually and with some calculation; machine learning happens when we write code to perform this for us. The example where we distinguished Iris Setosa from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation. We can, however, try to do it the best possible way. For this, we will perform a little computation. We first select only the non-Setosa features and labels: features = features[~is_setosa] labels = labels[~is_setosa] virginica = (labels == 'virginica') Here we are heavily using NumPy operations on the arrays. is_setosa is a Boolean array, and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new Boolean array, virginica, using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly: best_acc = -1.0 for fi in xrange(features.shape[1]): # We are going to generate all possible threshold for this feature thresh = features[:,fi].copy() thresh.sort() # Now test all thresholds: for t in thresh: pred = (features[:,fi] > t) acc = (pred == virginica).mean() if acc > best_acc: best_acc = acc best_fi = fi best_t = t The last few lines select the best model. First we compare the predictions, pred, with the actual labels, virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all possible thresholds for all possible features have been tested, and the best_fi and best_t variables hold our model. To apply it to a new example, we perform the following: if example[best_fi] > t: print 'virginica' else: print 'versicolor' What does this model look like? If we run it on the whole data, the best model that we get is split on the petal length. We can visualize the decision boundary. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Anything that falls in the white region will be called Iris Virginica and anything that falls on the shaded side will be classified as Iris Versicolor: In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where the points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold that will achieve exactly the same accuracy. Our method chose the first threshold, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the preceding section is a simple model; it achieves 94 percent accuracy on its training data. However, this evaluation may be overly optimistic. We used the data to define what the threshold would be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we have tried on this dataset. The logic is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two blocks: on one block, we'll train the model, and on the other—the one we held out of training—we'll test it. The output is as follows: Training error was 96.0%. Testing error was 90.0% (N = 50). The result of the testing data is lower than that of the training error. This may surprise an inexperienced machine learner, but it is expected and typical. To see why, look back at the plot that showed the decision boundary. See if some of the examples close to the boundary were not there or if one of the ones in between the two lines was missing. It is easy to imagine that the boundary would then move a little bit to the right or to the left so as to put them on the "wrong" side of the border. The error on the training data is called a training error and is always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing error; the error on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the two errors is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold off data from training, is that we only used part of the data (in this case, we used half of it) for training. On the other hand, if we use too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well. We can achieve something quite similar by cross-validation. One extreme (but sometimes useful) form of cross-validation is leave-one-out. We will take an example out of the training data, learn a model without this example, and then see if the model classifies this example correctly: error = 0.0 for ei in range(len(features)): # select all but the one at position 'ei': training = np.ones(len(features), bool) training[ei] = False testing = ~training model = learn_model(features[training], virginica[training]) predictions = apply_model(features[testing], virginica[testing], model) error += np.sum(predictions != virginica[testing]) At the end of this loop, we will have tested a series of models on all the examples. However, there is no circularity problem because each example was tested on a model that was built without taking the model into account. Therefore, the overall estimate is a reliable estimate of how well the models would generalize. The major problem with leave-one-out cross-validation is that we are now being forced to perform 100 times more work. In fact, we must learn a whole new model for each and every example, and this will grow as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation; here, "x" stands for a small number, say, five. In order to perform five-fold cross-validation, we break up the data in five groups, that is, five folds. Then we learn five models, leaving one fold out of each. The resulting code will be similar to the code given earlier in this section, but here we leave 20 percent of the data out instead of just one element. We test each of these models on the left out fold and average the results: The preceding figure illustrates this process for five blocks; the dataset is split into five pieces. Then for each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. Five or ten fold is typical; it corresponds to training with 80 or 90 percent of your data and should already be close to what you would get from using all the data. In an extreme case, if you have as many folds as datapoints, you can simply perform leave-one-out cross-validation. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, the results will not be representative. We will not go into the details of how to do this because the machine learning packages will handle it for you. We have now generated several models instead of just one. So, what final model do we return and use for the new data? The simplest solution is now to use a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate if your methods are doing well. At the end of the cross-validation loop, you can use all your data to train a final model. Although it was not properly recognized when machine learning was starting out, nowadays it is seen as a very bad sign to even discuss the training error of a classification system. This is because the results can be very misleading. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation schedule. Building more complex classifiers In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this article, you will see many other types of models, and we're not even going to cover everything that is out there. What makes up a classification model? We can break it up into three parts: The structure of the model: In this, we use a threshold on a single feature. The search procedure: In this, we try every possible combination of feature and threshold. The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum. We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values for each feature: the mean value of the features, the mean plus one standard deviation, and the mean minus one standard deviation. This could make sense if testing each value was very costly in terms of computer time (or if we had millions and millions of datapoints). Then the exhaustive search we used would be infeasible, and we would have to perform an approximation like this. Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and treatment is cheap with very few negative side effects, you want to minimize the false negatives as much as you can. With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. What the cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. Finally, we can also have other classification structures. A simple threshold rule is very limiting and will only work in the very simplest cases, such as with the Iris dataset. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We will now look at another agricultural dataset; it is still small, but now too big to comfortably plot exhaustively as we did with Iris. This is a dataset of the measurements of wheat seeds. Seven features are present, as follows: Area (A) Perimeter (P) Compactness () Length of kernel Width of kernel Asymmetry coefficient Length of kernel groove There are three classes that correspond to three wheat varieties: Canadian, Koma, and Rosa. As before, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset, and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images in digital form, compute a few relevant features from them, and use a generic classification system. Later, we will work through the computer vision side of this problem and compute features in images. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). Both the Iris and Seeds dataset used in this article were taken from there. The repository is available online: http://archive.ics.uci.edu/ml/ Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the "compactness", which is a typical feature for shapes (also called "roundness"). This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) as compared to kernels that are elongated (when the feature is close to zero). The goals of a good feature are to simultaneously vary with what matters and be invariant with what does not. For example, compactness does not vary with size but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to intuit which will be good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature types that you can build upon. For images, all of the previously mentioned features are typical, and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. Often though, you can use your knowledge of the specific problem to design a specific feature. Even before you have data, you must decide which data is worthwhile to collect. Then, you need to hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether or not we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice, very simple ideas work best. It does not make sense to use feature selection in these small problems, but if you had thousands of features, throwing out most of them might make the rest of the process much faster. Nearest neighbor classification With this dataset, even if we just try to separate two classes using the previous method, we do not get very good results. Let me introduce, therefore, a new classifier: the nearest neighbor classifier. If we consider that each example is represented by its features (in mathematical terms, as a point in N-dimensional space), we can compute the distance between examples. We can choose different ways of computing the distance, for example: def distance(p0, p1): 'Computes squared euclidean distance' return np.sum( (p0-p1)**2) Now when classifying, we adopt a simple rule: given a new example, we look at the dataset for the point that is closest to it (its nearest neighbor) and look at its label: def nn_classify(training_set, training_labels, new_example): dists = np.array([distance(t, new_example) for t in training_set]) nearest = dists.argmin() return training_labels[nearest] In this case, our model involves saving all of the training data and labels and computing everything at classification time. A better implementation would be to actually index these at learning time to speed up classification, but this implementation is a complex algorithm. Now, note that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples have exactly the same features but different labels, which can happen). Therefore, it is essential to test using a cross-validation protocol. Using ten folds for cross-validation for this dataset with this algorithm, we obtain 88 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. We will now examine the decision boundary. For this, we will be forced to simplify and look at only two dimensions (just so that we can plot it on paper). In the preceding screenshot, the Canadian examples are shown as diamonds, Kama seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22 while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance according to the preceding function, we are, for the most part, only taking the x axis into account. If you have a physics background, you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far away from the mean it is in terms of units of standard deviation. It comes down to this simple pair of operations: # subtract the mean for each feature: features -= features.mean(axis=0) # divide each feature by its standard deviation features /= features.std(axis=0) Independent of what the original values were, after Z-scoring, a value of zero is the mean and positive values are above the mean and negative values are below it. Now every feature is in the same unit (technically, every feature is now dimensionless; it has no units) and we can mix dimensions more confidently. In fact, if we now run our nearest neighbor classifier, we obtain 94 percent accuracy! Look at the decision space again in two dimensions; it looks as shown in the following screenshot: The boundaries are now much more complex and there is interaction between the two dimensions. In the full dataset, everything is happening in a seven-dimensional space that is very hard to visualize, but the same principle applies: where before a few dimensions were dominant, now they are all given the same importance. The nearest neighbor classifier is simple, but sometimes good enough. We can generalize it to a k-nearest neighbor classifier by considering not just the closest point but the k closest points. All k neighbors vote to select the label. k is typically a small number, such as 5, but can be larger, particularly if the dataset is very large. Binary and multiclass classification The first classifier we saw, the threshold classifier, was a simple binary classifier (the result is either one class or the other as a point is either above the threshold or it is not). The second classifier we used, the nearest neighbor classifier, was a naturally multiclass classifier (the output can be one of several classes). It is often simpler to define a simple binary method than one that works on multiclass problems. However, we can reduce the multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset in a haphazard way; we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If no, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of "one classifier versus the rest of the classifiers". For each possible label l, we build a classifier of the type "is this l or something else?". When applying the rule, exactly one of the classifiers would say "yes" and we would have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels in two and build a classifier that asks "should this example go to the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. However, which one you use normally does not make much of a difference to the final result. Most classifiers are binary systems while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. Summary In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with a classic dataset. This, by now, is considered a very small problem. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning. We also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that was not used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not something that is predefined for you, but choosing and designing features is an integral part of designing a machine-learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy as better data beats fancier methods. In this article, we wrote all of our own code (except when we used NumPy, of course). We needed to build up intuitions on simple cases to illustrate the basic concepts. Resources for Article: Further resources on this subject: Python Testing: Installing the Robot Framework [Article] Getting Started with Spring Python [Article] Creating Skeleton Apps with Coily in Spring Python [Article]
Read more
  • 0
  • 0
  • 3477

article-image-changing-appearance
Packt
13 Feb 2014
14 min read
Save for later

Changing the Appearance

Packt
13 Feb 2014
14 min read
(For more resources related to this topic, see here.) Controlling appearance An ADF Faces application is a modern web application, so the technology used for controlling the look of the application is Cascading Style Sheets (CSS). The idea behind CSS is that the web page (in HTML) should contain only the structure and not information about appearance. All of the visual definitions must be kept in the style sheet, and the HTML file must refer to the style sheet. This means that the same web page can be made to look completely different by applying a different style sheet to it. The Cascading Style Sheets basics In order to change the appearance of your application, you need to understand some CSS basics. If you have never worked with CSS before, you should start by reading one of the many CSS tutorials available on the Internet. To start with, let's repeat some of the basics of CSS. The CSS layout instructions are written in the form of rules. Each rule is in the following form: selector { property: value; } The selector function identifies which part of the web page the rule applies to, and the property/value pairs define the styling to be applied to the selected parts. For example, the following rule defines that all <h1> elements should be shown in red font: h1 { color: red; } One rule can include multiple selectors separated by commas, and multiple property values separated by semicolons. Therefore, it is also a valid CSS to write the following line of code to get all the <h1>, <h2>, and <h3> tags shown in large, red font: h1, h2, h3 { color: red; font-size: x-large; } If you want to apply a style with more precision than just every level 1 header, you define a style class, which is just a selector starting with a period, as shown in the following line of code: .important { color: red; font-weight: bold } To use this selector in your HTML code, you use the keyword class inside an HTML tag. There are three ways of using a style class. They are as follows: Inside an existing tag: <h1> Inside the special <span> tag to style the text within a paragraph Inside a <div> tag to style a whole paragraph of text Here are examples of all the three ways: <h1 class="important">Important topic</h1>You <span class="important">must</span> remember this.<div class="important">Important tip</div> In theory, you can place your styling information directly in your HTML document using the <style> tag. In practice, however, you usually place your CSS instructions in a separate .css file and refer to it from your HTML file with a <link> tag, as shown in the following line of code: <link href="mystyle.css" rel="stylesheet" type="text/css"> Styling individual components The preceding examples can be applied to HTML elements, but styling can also be applied to JSF components. A plain JSF component could look like the following code with inline styling: <h:outputFormat value="hello" style="color:red;"/> It can also look like the line of code shown using a style class: <h:outputFormat value="hello" styleClass="important"/> ADF components use the inlineStyle attribute instead of just style as shown in the following line of code: <af:outputFormat value="hello" inlineStyle="color:red;"/> The styleClass attribute is the same, as shown in the following line of code: <af:outputFormat value="hello" styleClass="important"/> Of course, you normally won't be setting these attributes in the source code, but will be using the StyleClass and InlineStyle properties in the Property Inspector instead. In both HTML and JSF, you should only use StyleClass so that multiple components can refer to the same style class and will reflect any change made to the style. InlineStyle is rarely used in real-life ADF applications; it adds to the page size (the same styling is sent for every styled element), and it is almost impossible to ensure that every occurrence is changed when the styling requirements change—as they will. Building a style While you are working out the styles you need in your application, you can use the Style section in the JDeveloper Properties window to define the look of your page, as shown in the following screenshot. This section shows six small subtabs with icons for font, background, border/outline, layout, table/list, and media/animation. If you enter or select a value on any of these tabs, this value will be placed into the InlineStyle field as a correctly formatted CSS. When your items look the way you want, copy the value from the InlineStyle field to a style class in your CSS file and set the StyleClass property to point to that class. If the style discussed earlier is the styling you want for a highlighted label, create a section in your CSS file, as shown in the following code: .highlight {background-color:blue;} Then, clear the InlineStyle property and set the StyleClass property to highlight. Once you have placed a style class into your CSS file, you can use it to style the other components in exactly the same way by simply setting the StyleClass property. We'll be building the actual CSS file where you define these style classes. InlineStyle and ContentStyle Some JSF components (for example, outputText) are easy to style—if you set the font color, you'll see it take effect in the JDeveloper design view and in your application, as shown in the following screenshot: Other elements (for example, inputText) are harder to style. For example, if you want to change the background color of the input field, you might try setting the background color, as shown in the following screenshot: You will notice that this did not work the way you reasonably expected—the background behind both the label and the actual input field changes. The reason for this is that an inputText component actually consists of several HTML elements, and an inline style applies to the outermost element. In this case, the outermost element is an HTML <tr> (table row) tag, so the green background color applies to the entire row. To help mitigate this problem, ADF offers another styling option for some components: ContentStyle. If you set this property, ADF tries to apply the style to the content of a component—in the case of an inputText, ContentStyle applies to the actual input field, as shown in the following screenshot: In a similar manner, you can apply styling to the label for an element by setting the LabelStyle property. Unravelling the mysteries of CSS styling As you saw in the Input Text example, ADF components can be quite complex, and it's not always easy to figure out which element to style to achieve the desired result. To be able to see into the complex HTML that ADF builds for you, you need a support tool such as Firebug. Firebug is a Firefox extension that you can download by navigating to Tools | Add-ons from within Firefox, or you can go to http://getfirebug.com. When you have installed Firebug, you see a little Firebug icon to the far right of your Firefox window, as shown in the following screenshot: When you click on the icon to start Firebug, you'll see it take up the lower half of your Firefox browser window. Only run Firebug when you need it Firebug's detailed analysis of every page costs processing power and slows your browser down. Run Firebug only when you need it. Remember to deactivate Firebug, not just hide it. If you click on the Inspect button (with a little blue arrow, second from the left in the Firebug toolbar), you place Firebug in inspect mode. You can now point to any element on a page and see both the HTML element and the style applied to this element, as shown in the following screenshot: In the preceding example, the pointer is placed on the label for an input text, and the Firebug panels show that this element is styled with color: #0000FF. If you scroll down in the right-hand side Style panel, you can see other attributes such as font-family: Tahoma, font-size: 11px, and so on. In order to keep the size of the HTML page smaller so that it loads faster, ADF has abbreviated all the style class names to cryptic short names such as .x10. While you are styling your application, you don't want this abbreviation to happen. To turn it off, you need to open the web.xml file (in your View project under Web Content | WEB-INF). Change to the Overview tab if it is not already shown, and select the Application subtab, as shown in the following screenshot: Under Context Initialization Parameters, add a new parameter, as shown: Name: org.apache.myfaces.trinidad.DISABLE_CONTENT_COMPRESSION Value: true When you do this, you'll see the full human-readable style names in Firebug, as shown in the following screenshot: You notice that you now get more readable names such as .af_outputLabel. You might need this information when developing your custom skin. Conditional formatting Similar to many other properties, the style properties do not have to be set to a fixed value—you can also set them to any valid expression written in Expression Language (EL). This can be used to create conditional formatting. In the simplest form, you can use an Expression Language ternary operator, which has the form <boolean expression> ? <value if true > : <value if false>. For example, you could set StyleClass to the following line of code: #{bindings.Job.inputValue eq 'MANAGER' ? 'managerStyle' :  'nonManagerStyle'} The preceding expression means that if the value of the Job attribute is equal to MANAGER, use the managerStyle style class; if not, use the nonManagerStyle style class. Of course, this only works if these two styles exist in your CSS. Skinning overview An ADF skin is a collection of files that together define the look and feel of the application. To a hunter, skinning is the process of removing the skin from an animal, but to an ADF developer it's the process of putting a skin onto an application. All applications have a skin—if you don't change it, an application built with JDeveloper 12c uses some variation of the skyros skin. When you define a custom skin, you must also choose a parent skin among the skins JDeveloper offers. This parent skin will define the look for all the components not explicitly defined in your skin. Skinning capabilities There are a few options to change the style of the individual components through their properties. However, with your own custom ADF skin, you can globally change almost every visual aspect of every instance of a specific component. To see skinning in action, you can go to http://jdevadf.oracle.com/adf-richclient-demo. This site is a demonstration of lots of ADF features and components, and if you choose the Skinning header in the accordion to the right, you are presented with a tree of skinnable components, as shown in the following screenshot: You can click on each component to see a page where you can experiment with various ways of skinning the component. For example, you can select the very common InputText component to see a page with various representations of the input text components. On the left-hand side, you see a number of Style Selectors that are relevant for that component. For each selector, you can check the checkbox to see an example of what the component looks like if you change that selector. In the following example, the af|inputText:disabled::content selector is checked, thus setting its style to color: #00C0C0, as shown in the following screenshot: As you might be able to deduce from the af|inputText:disabled::content style selector, this controls what the content field of the input text component looks like when it is set to disabled—in the demo application, it is set to a bluish color with the color code #00C0C0. The example application shows various values for the selectors but doesn't really explain them. The full documentation of all the selectors can be found online, at http://jdevadf.oracle.com/adf-richclient-demo/docs/skin-selectors.html. If it's not there, search for ADF skinning selectors. On the menu in the demo application, you also find a Skin menu that you can use to select and test all the built-in skins. This application can also be downloaded and run on your own server. It could be found on the ADF download page at http://www.oracle.com/technetwork/developer-tools/adf/downloads/index.html, as shown in the following screenshot: Skinning recommendations If your graphics designer has produced sample screens showing what the application must look like, you need to find out which components you will use to implement the required look and define the look of these components in your skin. If you don't have a detailed guideline from a graphics designer, look for some guidelines in your organization; you probably have web design guidelines for your public-facing website and/or intranet. If you don't have any graphics guidelines, create a skin, as described later in this section, and choose to inherit from the latest skyros skin provided by Oracle. However, don't change anything—leave the CSS file empty. If you are a programmer, you are unlikely to be able to improve on the look that the professional graphics designers at Oracle in Redwood Shores have created. The skinning process The skinning process in ADF consists of the following steps: Create a skin CSS file. Optionally, provide images for your skin. Optionally, create a resource bundle for your skin. Package the skin in an ADF library. Import and use the skin in the application. In JDeveloper 11g Release 1 ( 11.1.1.x) and earlier versions, this was very much a manual process. Fortunately, from 11g Release 2, JDeveloper has a built-in skinning editor. Stand-alone skinning If you are running JDeveloper 11g Release 1, don't despair. Oracle is making a stand-alone skinning editor available, containing the same functionality that is built into the later versions of JDeveloper. You can give this tool to your graphic designers and let them build the skin ADF library without having to give them the complete JDeveloper product. ADF skinning is a huge topic, and Oracle delivers a whole manual that describes skinning in complete detail. This document is called Creating Skins with Oracle ADF Skin Editor and can be found at http://docs.oracle.com/middleware/1212/skineditor/ADFSG. Creating a skin project You should place your skin in a common workspace. If you are using the modular architecture, you create your skin in the application common workspace. If you are using the enterprise architecture, you create your enterprise common skin in the application common workspace and then possibly create application-specific adaptations of that skin in the application common workspaces. The skin should be placed in its own project in the common workspace. Logically, it could be placed in the common UI workspace, but because the skin will often receive many small changes during certain phases of the project, it makes sense to keep it separate. Remember that changing the skin only affects the visual aspects of the application, but changing the page templates could conceivably change the functionality. By keeping the skin in a separate ADF library, you can be sure that you do not need to perform regression testing on the application functionality after deploying a new skin. To create your skin, open the common workspace and navigate to File | New | Project. Choose the ADF ViewController project, give the project the name CommonSkin, and set the package to your application's package prefix followed by .skin (for example, com.dmcsol.xdm.skin).
Read more
  • 0
  • 0
  • 3457

article-image-overview-common-machine-learning-tasks
Packt
14 Sep 2015
29 min read
Save for later

Introducing Bayesian Inference

Packt
14 Sep 2015
29 min read
In this article by Dr. Hari M. Kudovely, the author of Learning Bayesian Models with R, we will look at Bayesian inference in depth. The Bayes theorem is the basis for updating beliefs or model parameter values in Bayesian inference, given the observations. In this article, a more formal treatment of Bayesian inference will be given. To begin with, let us try to understand how uncertainties in a real-world problem are treated in Bayesian approach. (For more resources related to this topic, see here.) Bayesian view of uncertainty The classical or frequentist statistics typically take the view that any physical process-generating data containing noise can be modeled by a stochastic model with fixed values of parameters. The parameter values are learned from the observed data through procedures such as maximum likelihood estimate. The essential idea is to search in the parameter space to find the parameter values that maximize the probability of observing the data seen so far. Neither the uncertainty in the estimation of model parameters from data, nor the uncertainty in the model itself that explains the phenomena under study, is dealt with in a formal way. The Bayesian approach, on the other hand, treats all sources of uncertainty using probabilities. Therefore, neither the model to explain an observed dataset nor its parameters are fixed, but they are treated as uncertain variables. Bayesian inference provides a framework to learn the entire distribution of model parameters, not just the values, which maximize the probability of observing the given data. The learning can come from both the evidence provided by observed data and domain knowledge from experts. There is also a framework to select the best model among the family of models suited to explain a given dataset. Once we have the distribution of model parameters, we can eliminate the effect of uncertainty of parameter estimation in the future values of a random variable predicted using the learned model. This is done by averaging over the model parameter values through marginalization of joint probability distribution. Consider the joint probability distribution of N random variables again: This time, we have added one more term, m, to the argument of the probability distribution, in order to indicate explicitly that the parameters are generated by the model m. Then, according to Bayes theorem, the probability distribution of model parameters conditioned on the observed data  and model m is given by:   Formally, the term on the LHS of the equation  is called posterior probability distribution. The second term appearing in the numerator of RHS, , is called the prior probability distribution. It represents the prior belief about the model parameters, before observing any data, say, from the domain knowledge. Prior distributions can also have parameters and they are called hyperparameters. The term  is the likelihood of model m explaining the observed data. Since , it can be considered as a normalization constant . The preceding equation can be rewritten in an iterative form as follows:   Here,  represents values of observations that are obtained at time step n,  is the marginal parameter distribution updated until time step n - 1, and  is the model parameter distribution updated after seeing the observations  at time step n. Casting Bayes theorem in this iterative form is useful for online learning and it suggests the following: Model parameters can be learned in an iterative way as more and more data or evidence is obtained The posterior distribution estimated using the data seen so far can be treated as a prior model when the next set of observations is obtained Even if no data is available, one could make predictions based on prior distribution created using the domain knowledge alone To make these points clear, let's take a simple illustrative example. Consider the case where one is trying to estimate the distribution of the height of males in a given region. The data used for this example is the height measurement in centimeters obtained from M volunteers sampled randomly from the population. We assume that the heights are distributed according to a normal distribution with the mean  and variance :   As mentioned earlier, in classical statistics, one tries to estimate the values of  and  from observed data. Apart from the best estimate value for each parameter, one could also determine an error term of the estimate. In the Bayesian approach, on the other hand,  and  are also treated as random variables. Let's, for simplicity, assume  is a known constant. Also, let's assume that the prior distribution for  is a normal distribution with (hyper) parameters  and . In this case, the expression for posterior distribution of  is given by:   Here, for convenience, we have used the notation  for . It is a simple exercise to expand the terms in the product and complete the squares in the exponential. The resulting expression for the posterior distribution  is given by:   Here,  represents the sample mean. Though the preceding expression looks complex, it has a very simple interpretation. The posterior distribution is also a normal distribution with the following mean:   The variance is as follows:   The posterior mean is a weighted sum of prior mean  and sample mean . As the sample size M increases, the weight of the sample mean increases and that of the prior decreases. Similarly, posterior precision (inverse of the variance) is the sum of the prior precision  and precision of the sample mean :   As M increases, the contribution of precision from observations (evidence) outweighs that from the prior knowledge. Let's take a concrete example where we consider age distribution with the population mean 5.5 and population standard deviation 0.5. We sample 100 people from this population by using the following R script: >set.seed(100) >age_samples <- rnorm(10000,mean = 5.5,sd=0.5) We can calculate the posterior distribution using the following R function: >age_mean <- function(n){ mu0 <- 5 sd0 <- 1 mus <- mean(age_samples[1:n]) sds <- sd(age_samples[1:n]) mu_n <- (sd0^2/(sd0^2 + sds^2/n)) * mus + (sds^2/n/(sd0^2 + sds^2/n)) * mu0 mu_n } >samp <- c(25,50,100,200,400,500,1000,2000,5000,10000) >mu <- sapply(samp,age_mean,simplify = "array") >plot(samp,mu,type="b",col="blue",ylim=c(5.3,5.7),xlab="no of samples",ylab="estimate of mean") >abline(5.5,0) One can see that as the number of samples increases, the estimated mean asymptotically approaches the population mean. The initial low value is due to the influence of the prior, which is, in this case, 5.0. This simple and intuitive picture of how the prior knowledge and evidence from observations contribute to the overall model parameter estimate holds in any Bayesian inference. The precise mathematical expression for how they combine would be different. Therefore, one could start using a model for prediction with just prior information, either from the domain knowledge or the data collected in the past. Also, as new observations arrive, the model can be updated using the Bayesian scheme. Choosing the right prior distribution In the preceding simple example, we saw that if the likelihood function has the form of a normal distribution, and when the prior distribution is chosen as normal, the posterior also turns out to be a normal distribution. Also, we could get a closed-form analytical expression for the posterior mean. Since the posterior is obtained by multiplying the prior and likelihood functions and normalizing by integration over the parameter variables, the form of the prior distribution has a significant influence on the posterior. This section gives some more details about the different types of prior distributions and guidelines as to which ones to use in a given context. There are different ways of classifying prior distributions in a formal way. One of the approaches is based on how much information a prior provides. In this scheme, the prior distributions are classified as Informative, Weakly Informative, Least Informative, and Non-informative. Here, we take more of a practitioner's approach and illustrate some of the important classes of the prior distributions commonly used in practice. Non-informative priors Let's start with the case where we do not have any prior knowledge about the model parameters. In this case, we want to express complete ignorance about model parameters through a mathematical expression. This is achieved through what are called non-informative priors. For example, in the case of a single random variable x that can take any value between  and , the non-informative prior for its mean   would be the following: Here, the complete ignorance of the parameter value is captured through a uniform distribution function in the parameter space. Note that a uniform distribution is not a proper distribution function since its integral over the domain is not equal to 1; therefore, it is not normalizable. However, one can use an improper distribution function for the prior as long as it is multiplied by the likelihood function; the resulting posterior can be normalized. If the parameter of interest is variance , then by definition it can only take non-negative values. In this case, we transform the variable so that the transformed variable has a uniform probability in the range from  to : It is easy to show, using simple differential calculus, that the corresponding non-informative distribution function in the original variable  would be as follows: Another well-known non-informative prior used in practical applications is the Jeffreys prior, which is named after the British statistician Harold Jeffreys. This prior is invariant under reparametrization of  and is defined as proportional to the square root of the determinant of the Fisher information matrix: Here, it is worth discussing the Fisher information matrix a little bit. If X is a random variable distributed according to , we may like to know how much information observations of X carry about the unknown parameter . This is what the Fisher Information Matrix provides. It is defined as the second moment of the score (first derivative of the logarithm of the likelihood function): Let's take a simple two-dimensional problem to understand the Fisher information matrix and Jeffreys prior. This example is given by Prof. D. Wittman of the University of California. Let's consider two types of food item: buns and hot dogs. Let's assume that generally they are produced in pairs (a hot dog and bun pair), but occasionally hot dogs are also produced independently in a separate process. There are two observables such as the number of hot dogs () and the number of buns (), and two model parameters such as the production rate of pairs () and the production rate of hot dogs alone (). We assume that the uncertainty in the measurements of the counts of these two food products is distributed according to the normal distribution, with variance  and , respectively. In this case, the Fisher Information matrix for this problem would be as follows: In this case, the inverse of the Fisher information matrix would correspond to the covariance matrix: Subjective priors One of the key strengths of Bayesian statistics compared to classical (frequentist) statistics is that the framework allows one to capture subjective beliefs about any random variables. Usually, people will have intuitive feelings about minimum, maximum, mean, and most probable or peak values of a random variable. For example, if one is interested in the distribution of hourly temperatures in winter in a tropical country, then the people who are familiar with tropical climates or climatology experts will have a belief that, in winter, the temperature can go as low as 15°C and as high as 27°C with the most probable temperature value being 23°C. This can be captured as a prior distribution through the Triangle distribution as shown here. The Triangle distribution has three parameters corresponding to a minimum value (a), the most probable value (b), and a maximum value (c). The mean and variance of this distribution are given by:   One can also use a PERT distribution to represent a subjective belief about the minimum, maximum, and most probable value of a random variable. The PERT distribution is a reparametrized Beta distribution, as follows:   Here:     The PERT distribution is commonly used for project completion time analysis, and the name originates from project evaluation and review techniques. Another area where Triangle and PERT distributions are commonly used is in risk modeling. Often, people also have a belief about the relative probabilities of values of a random variable. For example, when studying the distribution of ages in a population such as Japan or some European countries, where there are more old people than young, an expert could give relative weights for the probability of different ages in the populations. This can be captured through a relative distribution containing the following details: Here, min and max represent the minimum and maximum values, {values} represents the set of possible observed values, and {weights} represents their relative weights. For example, in the population age distribution problem, these could be the following: The weights need not have a sum of 1. Conjugate priors If both the prior and posterior distributions are in the same family of distributions, then they are called conjugate distributions and the corresponding prior is called a conjugate prior for the likelihood function. Conjugate priors are very helpful for getting get analytical closed-form expressions for the posterior distribution. In the simple example we considered, we saw that when the noise is distributed according to the normal distribution, choosing a normal prior for the mean resulted in a normal posterior. The following table gives examples of some well-known conjugate pairs: Likelihood function Model parameters Conjugate prior Hyperparameters Binomial   (probability) Beta   Poisson   (rate) Gamma   Categorical   (probability, number of categories) Dirichlet   Univariate normal (known variance )   (mean) Normal   Univariate normal (known mean )   (variance) Inverse Gamma     Hierarchical priors Sometimes, it is useful to define prior distributions for the hyperparameters itself. This is consistent with the Bayesian view that all parameters should be treated as uncertain by using probabilities. These distributions are called hyper-prior distributions. In theory, one can continue this into many levels as a hierarchical model. This is one way of eliciting the optimal prior distributions. For example: is the prior distribution with a hyperparameter . We could define a prior distribution for  through a second set of equations, as follows: Here,  is the hyper-prior distribution for the hyperparameter , parametrized by the hyper-hyper-parameter . One can define a prior distribution for in the same way and continue the process forever. The practical reason for formalizing such models is that, at some level of hierarchy, one can define a uniform prior for the hyper parameters, reflecting complete ignorance about the parameter distribution, and effectively truncate the hierarchy. In practical situations, typically, this is done at the second level. This corresponds to, in the preceding example, using a uniform distribution for . I want to conclude this section by stressing one important point. Though prior distribution has a significant role in Bayesian inference, one need not worry about it too much, as long as the prior chosen is reasonable and consistent with the domain knowledge and evidence seen so far. The reasons are is that, first of all, as we have more evidence, the significance of the prior gets washed out. Secondly, when we use Bayesian models for prediction, we will average over the uncertainty in the estimation of the parameters using the posterior distribution. This averaging is the key ingredient of Bayesian inference and it removes many of the ambiguities in the selection of the right prior. Estimation of posterior distribution So far, we discussed the essential concept behind Bayesian inference and also how to choose a prior distribution. Since one needs to compute the posterior distribution of model parameters before one can use the models for prediction, we discuss this task in this section. Though the Bayesian rule has a very simple-looking form, the computation of posterior distribution in a practically usable way is often very challenging. This is primarily because computation of the normalization constant  involves N-dimensional integrals, when there are N parameters. Even when one uses a conjugate prior, this computation can be very difficult to track analytically or numerically. This was one of the main reasons for not using Bayesian inference for multivariate modeling until recent decades. In this section, we will look at various approximate ways of computing posterior distributions that are used in practice. Maximum a posteriori estimation Maximum a posteriori (MAP) estimation is a point estimation that corresponds to taking the maximum value or mode of the posterior distribution. Though taking a point estimation does not capture the variability in the parameter estimation, it does take into account the effect of prior distribution to some extent when compared to maximum likelihood estimation. MAP estimation is also called poor man's Bayesian inference. From the Bayes rule, we have: Here, for convenience, we have used the notation X for the N-dimensional vector . The last relation follows because the denominator of RHS of Bayes rule is independent of . Compare this with the following maximum likelihood estimate: The difference between the MAP and ML estimate is that, whereas ML finds the mode of the likelihood function, MAP finds the mode of the product of the likelihood function and prior. Laplace approximation We saw that the MAP estimate just finds the maximum value of the posterior distribution. Laplace approximation goes one step further and also computes the local curvature around the maximum up to quadratic terms. This is equivalent to assuming that the posterior distribution is approximately Gaussian (normal) around the maximum. This would be the case if the amount of data were large compared to the number of parameters: M >> N. Here, A is an N x N Hessian matrix obtained by taking the derivative of the log of the posterior distribution: It is straightforward to evaluate the previous expressions at , using the following definition of conditional probability: We can get an expression for P(X|m) from Laplace approximation that looks like the following: In the limit of a large number of samples, one can show that this expression simplifies to the following: The term  is called Bayesian information criterion (BIC) and can be used for model selections or model comparison. This is one of the goodness of fit terms for a statistical model. Another similar criterion that is commonly used is Akaike information criterion (AIC), which is defined by . Now we will discuss how BIC can be used to compare different models for model selection. In the Bayesian framework, two models such as  and  are compared using the Bayes factor. The definition of the Bayes factor  is the ratio of posterior odds to prior odds that is given by: Here, posterior odds is the ratio of posterior probabilities of the two models of the given data and prior odds is the ratio of prior probabilities of the two models, as given in the preceding equation. If , model  is preferred by the data and if , model  is preferred by the data. In reality, it is difficult to compute the Bayes factor because it is difficult to get the precise prior probabilities. It can be shown that, in the large N limit,  can be viewed as a rough approximation to . Monte Carlo simulations The two approximations that we have discussed so far, the MAP and Laplace approximations, are useful when the posterior is a very sharply peaked function about the maximum value. Often, in real-life situations, the posterior will have long tails. This is, for example, the case in e-commerce where the probability of the purchasing of a product by a user has a long tail in the space of all products. So, in many practical situations, both MAP and Laplace approximations fail to give good results. Another approach is to directly sample from the posterior distribution. Monte Carlo simulation is a technique used for sampling from the posterior distribution and is one of the workhorses of Bayesian inference in practical applications. In this section, we will introduce the reader to Markov Chain Monte Carlo (MCMC) simulations and also discuss two common MCMC methods used in practice. As discussed earlier, let  be the set of parameters that we are interested in estimating from the data through posterior distribution. Consider the case of the parameters being discrete, where each parameter has K possible values, that is, . Set up a Markov process with states  and transition probability matrix . The essential idea behind MCMC simulations is that one can choose the transition probabilities in such a way that the steady state distribution of the Markov chain would correspond to the posterior distribution we are interested in. Once this is done, sampling from the Markov chain output, after it has reached a steady state, will give samples of distributed according to the posterior distribution. Now, the question is how to set up the Markov process in such a way that its steady state distribution corresponds to the posterior of interest. There are two well-known methods for this. One is the Metropolis-Hastings algorithm and the second is Gibbs sampling. We will discuss both in some detail here. The Metropolis-Hasting algorithm The Metropolis-Hasting algorithm was one of the first major algorithms proposed for MCMC. It has a very simple concept—something similar to a hill-climbing algorithm in optimization: Let  be the state of the system at time step t. To move the system to another state at time step t + 1, generate a candidate state  by sampling from a proposal distribution . The proposal distribution is chosen in such a way that it is easy to sample from it. Accept the proposal move with the following probability: If it is accepted, = ; if not, . Continue the process until the distribution converges to the steady state. Here,  is the posterior distribution that we want to simulate. Under certain conditions, the preceding update rule will guarantee that, in the large time limit, the Markov process will approach a steady state distributed according to . The intuition behind the Metropolis-Hasting algorithm is simple. The proposal distribution  gives the conditional probability of proposing state  to make a transition in the next time step from the current state . Therefore,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. Similarly,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. If the ratio of these two probabilities is more than 1, accept the move. Alternatively, accept the move only with the probability given by the ratio. Therefore, the Metropolis-Hasting algorithm is like a hill-climbing algorithm where one accepts all the moves that are in the upward direction and accepts moves in the downward direction once in a while with a smaller probability. The downward moves help the system not to get stuck in local minima. Let's revisit the example of estimating the posterior distribution of the mean and variance of the height of people in a population discussed in the introductory section. This time we will estimate the posterior distribution by using the Metropolis-Hasting algorithm. The following lines of R code do this job: >set.seed(100) >mu_t <- 5.5 >sd_t <- 0.5 >age_samples <- rnorm(10000,mean = mu_t,sd = sd_t) >#function to compute log likelihood >loglikelihood <- function(x,mu,sigma){ singlell <- dnorm(x,mean = mu,sd = sigma,log = T) sumll <- sum(singlell) sumll } >#function to compute prior distribution for mean on log scale >d_prior_mu <- function(mu){ dnorm(mu,0,10,log=T) } >#function to compute prior distribution for std dev on log scale >d_prior_sigma <- function(sigma){ dunif(sigma,0,5,log=T) } >#function to compute posterior distribution on log scale >d_posterior <- function(x,mu,sigma){ loglikelihood(x,mu,sigma) + d_prior_mu(mu) + d_prior_sigma(sigma) } >#function to make transition moves tran_move <- function(x,dist = .1){ x + rnorm(1,0,dist) } >num_iter <- 10000 >posterior <- array(dim = c(2,num_iter)) >accepted <- array(dim=num_iter - 1) >theta_posterior <-array(dim=c(2,num_iter)) >values_initial <- list(mu = runif(1,4,8),sigma = runif(1,1,5)) >theta_posterior[1,1] <- values_initial$mu >theta_posterior[2,1] <- values_initial$sigma >for (t in 2:num_iter){ #proposed next values for parameters theta_proposed <- c(tran_move(theta_posterior[1,t-1]) ,tran_move(theta_posterior[2,t-1])) p_proposed <- d_posterior(age_samples,mu = theta_proposed[1] ,sigma = theta_proposed[2]) p_prev <-d_posterior(age_samples,mu = theta_posterior[1,t-1] ,sigma = theta_posterior[2,t-1]) eps <- exp(p_proposed - p_prev) # proposal is accepted if posterior density is higher w/ theta_proposed # if posterior density is not higher, it is accepted with probability eps accept <- rbinom(1,1,prob = min(eps,1)) accepted[t - 1] <- accept if (accept == 1){ theta_posterior[,t] <- theta_proposed } else { theta_posterior[,t] <- theta_posterior[,t-1] } } To plot the resulting posterior distribution, we use the sm package in R: >library(sm) x <- cbind(c(theta_posterior[1,1:num_iter]),c(theta_posterior[2,1:num_iter])) xlim <- c(min(x[,1]),max(x[,1])) ylim <- c(min(x[,2]),max(x[,2])) zlim <- c(0,max(1)) sm.density(x, xlab = "mu",ylab="sigma", zlab = " ",zlim = zlim, xlim = xlim ,ylim = ylim,col="white") title("Posterior density")  The resulting posterior distribution will look like the following figure:   Though the Metropolis-Hasting algorithm is simple to implement for any Bayesian inference problem, in practice it may not be very efficient in many cases. The main reason for this is that, unless one carefully chooses a proposal distribution , there would be too many rejections and it would take a large number of updates to reach the steady state. This is particularly the case when the number of parameters are high. There are various modifications of the basic Metropolis-Hasting algorithms that try to overcome these difficulties. We will briefly describe these when we discuss various R packages for the Metropolis-Hasting algorithm in the following section. R packages for the Metropolis-Hasting algorithm There are several contributed packages in R for MCMC simulation using the Metropolis-Hasting algorithm, and here we describe some popular ones. The mcmc package contributed by Charles J. Geyer and Leif T. Johnson is one of the popular packages in R for MCMC simulations. It has the metrop function for running the basic Metropolis-Hasting algorithm. The metrop function uses a multivariate normal distribution as the proposal distribution. Sometimes, it is useful to make a variable transformation to improve the speed of convergence in MCMC. The mcmc package has a function named morph for doing this. Combining these two, the function morph.metrop first transforms the variable, does a Metropolis on the transformed density, and converts the results back to the original variable. Apart from the mcmc package, two other useful packages in R are MHadaptive contributed by Corey Chivers and the Evolutionary Monte Carlo (EMC) algorithm package by Gopi Goswami. Gibbs sampling As mentioned before, the Metropolis-Hasting algorithm suffers from the drawback of poor convergence, due to too many rejections, if one does not choose a good proposal distribution. To avoid this problem, two physicists Stuart Geman and Donald Geman proposed a new algorithm. This algorithm is called Gibbs sampling and it is named after the famous physicist J W Gibbs. Currently, Gibbs sampling is the workhorse of MCMC for Bayesian inference. Let  be the set of parameters of the model that we wish to estimate: Start with an initial state . At each time step, update the components one by one, by drawing from a distribution conditional on the most recent value of rest of the components:         After N steps, all components of the parameter will be updated. Continue with step 2 until the Markov process converges to a steady state.  Gibbs sampling is a very efficient algorithm since there are no rejections. However, to be able to use Gibbs sampling, the form of the conditional distributions of the posterior distribution should be known. R packages for Gibbs sampling Unfortunately, there are not many contributed general purpose Gibbs sampling packages in R. The gibbs.met package provides two generic functions for performing MCMC in a Naïve way for user-defined target distribution. The first function is gibbs_met. This performs Gibbs sampling with each 1-dimensional distribution sampled by using the Metropolis algorithm, with normal distribution as the proposal distribution. The second function, met_gaussian, updates the whole state with independent normal distribution centered around the previous state. The gibbs.met package is useful for general purpose MCMC on moderate dimensional problems. Apart from the general purpose MCMC packages, there are several packages in R designed to solve a particular type of machine-learning problems. The GibbsACOV package can be used for one-way mixed-effects ANOVA and ANCOVA models. The lda package performs collapsed Gibbs sampling methods for topic (LDA) models. The stocc package fits a spatial occupancy model via Gibbs sampling. The binomlogit package implements an efficient MCMC for Binomial Logit models. Bmk is a package for doing diagnostics of MCMC output. Bayesian Output Analysis Program (BOA) is another similar package. RBugs is an interface of the well-known OpenBUGS MCMC package. The ggmcmc package is a graphical tool for analyzing MCMC simulation. MCMCglm is a package for generalized linear mixed models and BoomSpikeSlab is a package for doing MCMC for Spike and Slab regression. Finally, SamplerCompare is a package (more of a framework) for comparing the performance of various MCMC packages. Variational approximation In the variational approximation scheme, one assumes that the posterior distribution  can be approximated to a factorized form: Note that the factorized form is also a conditional distribution, so each  can have dependence on other s through the conditioned variable X. In other words, this is not a trivial factorization making each parameter independent. The advantage of this factorization is that one can choose more analytically tractable forms of distribution functions . In fact, one can vary the functions  in such a way that it is as close to the true posterior  as possible. This is mathematically formulated as a variational calculus problem, as explained here. Let's use some measures to compute the distance between the two probability distributions, such as  and , where . One of the standard measures of distance between probability distributions is the Kullback-Leibler divergence, or KL-divergence for short. It is defined as follows: The reason why it is called a divergence and not distance is that  is not symmetric with respect to Q and P. One can use the relation  and rewrite the preceding expression as an equation for log P(X): Here: Note that, in the equation for ln P(X), there is no dependence on Q on the LHS. Therefore, maximizing  with respect to Q will minimize , since their sum is a term independent of Q. By choosing analytically tractable functions for Q, one can do this maximization in practice. It will result in both an approximation for the posterior and a lower bound for ln P(X) that is the logarithm of evidence or marginal likelihood, since . Therefore, variational approximation gives us two quantities in one shot. A posterior distribution can be used to make predictions about future observations (as explained in the next section) and a lower bound for evidence can be used for model selection. How does one implement this minimization of KL-divergence in practice? Without going into mathematical details, here we write a final expression for the solution: Here,  implies that the expectation of the logarithm of the joint distribution  is taken over all the parameters  except for . Therefore, the minimization of KL-divergence leads to a set of coupled equations; one for each  needs to be solved self-consistently to obtain the final solution. Though the variational approximation looks very complex mathematically, it has a very simple, intuitive explanation. The posterior distribution of each parameter  is obtained by averaging the log of the joint distribution over all the other variables. This is analogous to the Mean Field theory in physics where, if there are N interacting charged particles, the system can be approximated by saying that each particle is in a constant external field, which is the average of fields produced by all the other particles. We will end this section by mentioning a few R packages for variational approximation. The VBmix package can be used for variational approximation in Bayesian mixture models. A similar package is vbdm used for Bayesian discrete mixture models. The package vbsr is used for variational inference in Spike Regression Regularized Linear Models. Prediction of future observations Once we have the posterior distribution inferred from data using some of the methods described already, it can be used to predict future observations. The probability of observing a value Y, given observed data X, and posterior distribution of parameters  is given by: Note that, in this expression, the likelihood function  is averaged by using the distribution of the parameter given by the posterior . This is, in fact, the core strength of the Bayesian inference. This Bayesian averaging eliminates the uncertainty in estimating the parameter values and makes the prediction more robust. Summary In this article, we covered the basic principles of Bayesian inference. Starting with how uncertainty is treated differently in Bayesian statistics compared to classical statistics, we discussed deeply various components of Bayes' rule. Firstly, we learned the different types of prior distributions and how to choose the right one for your problem. Then we learned the estimation of posterior distribution using techniques such as MAP estimation, Laplace approximation, and MCMC simulations. Resources for Article: Further resources on this subject: Bayesian Network Fundamentals [article] Learning Data Analytics with R and Hadoop [article] First steps with R [article]
Read more
  • 0
  • 0
  • 3455

article-image-remote-job-agent-oracle-11g-database-oracle-scheduler
Packt
09 Oct 2009
7 min read
Save for later

Remote Job Agent in Oracle 11g Database with Oracle Scheduler

Packt
09 Oct 2009
7 min read
Oracle Scheduler in Oracle 10g is a very powerful tool. However, Oracle 11g has many added advantages that give you more power. In this article by Ronald Rood, we will get our hands on the most important addition to the Scheduler—the remote job agent . This is a whole new kind of process, which allows us to run jobs on machines that do not have a running database. However, they must have Oracle Scheduler Agent installed, as this agent is responsible for executing the remote job. This gives us a lot of extra power and also solves the process owner's problem that exists in classical local external jobs. In classical local external jobs, the process owner is by default nobody and is controlled by $ORACLE_HOME/rdbms/admin/externaljob.ora. This creates problems in installation, where the software is shared between multiple databases because it is not possible to separate the processes. In this article, we will start by installing the software, and then see how we can make good use of it. After this, you will want to get rid of the classical local external jobs as soon as possible because you will want to embrace all the improvements in the remote job agent over the old job type. Security Anything that runs on our database server can cause havoc to our databases. No matter what happens, we want to be sure that our databases cannot be harmed. As we have no control over the contents of scripts that can be called from the database, it seems logical not to have these scripts run by the same operating system user who also owns the Oracle database files and processes. This is why, by default, Oracle chose the user nobody as the default user to run the classical local external jobs. This can be adjusted by editing the contents of $ORACLE_HOME/rdbms/admin/externaljob.ora. On systems where more databases are using the same $ORACLE_HOME directory, this automatically means that all the databases run their external jobs using the same operating system account. This is not very flexible. Luckily for us, Oracle has changed this in the 11g release where remote external jobs are introduced. In this release, Oracle decoupled the job runner process and the database processes. The job runner process, that is the job agent, now runs as a remote process and is contacted using a host:port combination over TCP/IP. The complete name for the agent is remote job agent, but this does not mean the job agent can be installed only remotely. It can be installed on the same machine where the database runs, and where it can easily replace the old-fashioned remote jobs. As the communication is done by TCP/IP, this job agent process can be run using any account on the machine. Oracle has no recommendations for the account, but this could very well be nobody. The operating system user who runs the job agent does need some privileges in the $ORACLE_HOME directory of the remote job agent, namely, an execution privilege on $ORACLE_HOME/bin/* as well as read privileges on $ORACLE_HOME/lib/*. At the end of the day, the user has to be able to use the software. The remote job agent should also have the ability to write its administration (log) in a location that (by default) is in $ORACLE_HOME/data, but it can be configured to a different location by setting the EXECUTION_AGENT_DATA environment variable. In 11g, Oracle also introduced a new object type called CREDENTIAL. We can create credentials using dbms_scheduler.create_credential. This allows us to administrate which operating system user is going to run our jobs in the database. This also allows us to have control over who can use this credential. To see which credentials are defined, we can use the *_SCHEDULER_CREDENTIAL views. We can grant access to a credential by granting execute privilege on the credential. This adds lots more control than we ever had in Oracle 10gR2. Currently, the Scheduler Agent can only use a username-password combination to authenticate against the operating system. The jobs scheduled on the remote job agent will run using the account specified in the credential that we use in the job definition. Check the Creating job section to see how this works. This does introduce a small problem in maintenance. On many systems, customers are forced to use security policies such as password aging. When combining with credentials, this might cause a credential to become invalid. Any change in the password of a job runtime account needs to be reflected in the credential definition that uses the account. As we get much more control over who executes a job, it is strongly recommend to use the new remote job agent in favor of the classical local external jobs, even locally. The classical external job type will soon become history. A quick glimpse with a wireshark, a network sniffer, does not reveal the credentials in the clear text, so it looks like it's secure by default. However, the job results do pass in clear text. The agent and the database communicate using SSL and because of this, a certificate is installed in the ${EXECUTION_AGENT_DATA}/agent.key.  You can check this certificate using Firefox. Just point your browser to the host:port where the Scheduler Agent is running and use Firefox to examine the certificate. There is a bug in 11.1.0.6 that generates a certificate with an expiration date of 90 days past the agent's registration date. In such a case, you will start receiving certificate validation errors when trying to launch a job. Stopping the agent can solve this. Just remove the agent.key and re-register the agent with the database. The registration will be explained in this article shortly. Installation on Windows We need to get the software before the installation can take place. The Scheduler Agent can be found on the Transparent Gateways disk, which can be downloaded from Oracle technet at http://www.oracle.com/technology/software/products/database/index.html. There's no direct link to this software, so find a platform of your choice and click on See All to get the complete list of database software products for that platform. Then download the Oracle Database Gateways CD. Unzip the installation CD, and then navigate to the setup program found in the top level folder and start it. The following screenshot shows the download directory where you run the setup file: After running the setup, the following Welcome screen will appear. The installation process is simple. Click on the Next button to continue to the product selection screen. Select Oracle Scheduler Agent 11.1.0.6.0 and click on the Next button to continue. Enter Name and Path for ORACLE_HOME (we can keep the default values). Now click on Next to reach the screen where we can choose a port on which the database can contact the agent. I chose 15021. On Unix systems, pick a port above 1023 because the lower ports require root privileges to open. The port should be unused and easily memorizable, and should not be used by the database's listener process. If possible, keep all the remote job agents registered to the same database and the same port. Also, don't forget to open the firewall for that port. Hitting the Next button brings us to the following Summary screen: We click on the Install button to complete the installation. If everything goes as expected, the End of Installation screen pops up as follows: Click on the Exit button and confirm the exit. We can find Oracle Execution Agent in the services control panel. Make sure it is running when you want to use the agent to run jobs.
Read more
  • 0
  • 0
  • 3452
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-transactions-and-operators
Packt
13 Oct 2015
14 min read
Save for later

Transactions and Operators

Packt
13 Oct 2015
14 min read
In this article by Emilien Kenler and Federico Razzoli, author of the book MariaDB Essentials, he has explained in brief about transactions and operators. (For more resources related to this topic, see here.) Understanding transactions A transaction is a sequence of SQL statements that are grouped into a single logical operation. Its purpose is to guarantee the integrity of data. If a transaction fails, no change will be applied to the databases. If a transaction succeeds, all the statements will succeed. Take a look at the following example: START TRANSACTION; SELECT quantity FROM product WHERE id = 42; UPDATE product SET quantity = quantity - 10 WHERE id = 42; UPDATE customer SET money = money - 0(SELECT price FROM product WHERE id = 42) WHERE id = 512; INSERT INTO product_order (product_id, quantity, customer_id) VALUES (42, 10, 512); COMMIT; We haven't yet discussed some of the statements used in this example. However, they are not important to understand transactions. This sequence of statements occur when a customer (whose id is 512) ordered a product (whose id is 42). As a consequence, we need to execute the following suboperations in our database: Check whether the desired quantity of products is available. If not, we should not proceed Decrease the available quantity of items for the product that is being bought Decrease the amount of money in the online account of our customer Register the order so that the product is delivered to our customer These suboperations form a more complex operation. When a session is executing this operation, we do not want other connections to interfere. Consider the following scenarios: Connection checks how many products with the ID 42 are available. Only one is available, but it is enough. Immediately after, the connection B checks the availability of the same product. It finds that one is available. Connection A decreases the quantity of the product. Now, it is 0. Connection B decreases the same number. Now, it is -1. Both connections create an order. Two persons will pay for the same product; however, only one is available. This is something we definitely want to avoid. However, there is another situation that we want to avoid. Imagine that the server crashes immediately after the customer's money is deducted. The order will not be written to the database, so the customer will end up paying for something he will not receive. Fortunately, transactions prevent both these situations. They protect our database writes in two ways: During a transaction, relevant data is locked or copied. In both these cases, two connections will not be able to modify the same rows at the same time. The writes will not be made effective until the COMMIT command is issued. This means that if the server crashes during the transaction, all the suboperations will be rolled back. We will not have inconsistent data (such as a payment for a product that will not be delivered). In this example, the transaction starts when we issue the START TRANSACTION command. Then, any number of operations can be performed. The COMMIT command makes the changes effective. This does not mean that if a statement fails with an error, the transaction is always aborted. In many cases, the application will receive an error and will be free to decide whether the transaction should be aborted or not. To abort the current transaction, an application can execute the ROLLBACK command. A transaction can consist of only one statement. This perfectly makes sense because the server could crash in the middle of the statement's execution. The autocommit mode In many cases, we don't want to group multiple statements in a transaction. When a transaction consists of only one statement, sending the START TRANSACTION and COMMIT statements can be annoying. For this reason, MariaDB has the autocommit mode. By default, the autocommit mode is ON. Unless a START TRANSACTION command is explicitly used, the autocommit mode causes an implicit commit after each statement. Thus, every statement is executed in a separated transaction by default. When the autocommit mode is OFF, a new transaction implicitly starts after each commit, and the COMMIT command needs be issued explicitly. To turn the autocommit ON or OFF, we can use the @@autocommit server variable as follows: follows: MariaDB [mwa]> SET @@autocommit = OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@autocommit; +--------------+ | @@autocommit | +--------------+ | 0 | +--------------+ 1 row in set (0.00 sec) Transaction's limitations in MariaDB Transaction handling is not implemented in the core of MariaDB; instead, it is left to the storage engines. Many storage engines, such as MyISAM or MEMORY, do not implement it at all. Some of the transactional storage engines are: InnoDB; XtraDB; TokuDB. In a sense, Aria tables are partially transactional. Although Aria ignores commands such as START TRANSACTION, COMMIT, and ROLLBACK, each statement is somewhat a transaction. In fact, if it writes, modifies, or deletes multiple rows, the operation completely succeeds or fails, which is similar to a transaction. Only statements that modify data can be used in a transaction. Statements that modify a table structure (such as ALTER TABLE) implicitly commit the current transaction. Sometimes, we may not be sure if a transaction is active or not. Usually, this happens because we are not sure if autocommit is set to ON or not or because we are not sure if the latest statement implicitly committed a transaction. In these cases, the @in_transaction variable can help us. Its value is 1 if a transaction is active and 0 if it is not. Here is an example: MariaDB [mwa]> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) MariaDB [mwa]> DROP TABLE IF EXISTS t; Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec) InnoDB is optimized to execute a huge number of short transactions. If our databases are busy and performance is important to us, we should try to avoid big transactions in terms of the number of statements and execution time. This is particularly true if we have several concurrent connections that read the same tables. Working with operators In our examples, we have used several operators, such as equals (=), less-than and greater-than (<, >), and so on. Now, it is time to discuss operators in general and list the most important ones. In general, an operator is a sign that takes one or more operands and returns a result. Several groups of operators exist in MariaDB. In this article, we will discuss the main types: Comparison operators; String operators; Logical operators; Arithmetic operators. Comparison operators A comparison operator checks whether there is a certain relation between its operands. If the relationship exists, the operator returns 1; otherwise, it returns 0. For example, let's take the equality operator that is probably used the most: 1 = 1 -- returns 1: the equality relationship exists 1 = 0 -- returns 0: no equality relationship here In MariaDB, 1 and 0 are used in many contexts to indicate whether something is true or false. In fact, MariaDB does not have a Boolean data type, so TRUE and FALSE are merely used as aliases for 1 and 0: TRUE = 1 -- returns 1 FALSE = 0 -- returns 1 TRUE = FALSE -- returns 0 In a WHERE clause, a result of 0 or NULL prevents a row to be shown. All the numeric results other than 0, including negative numbers, are regarded as true in this context. Non-numeric values other than NULL need to be converted to numbers in order to be evaluated by the WHERE clause. Non-numeric strings are converted to 0, whereas numeric strings are treated as numbers. Dates are converted to nonzero numbers.Consider the following example: WHERE 1 -- is redundant; it shows all the rows WHERE 0 -- prevents all the rows from being shown Now, let's take a look at the following MariaDB comparison operators: Operator Description Example = This specifies equality A = B != This indicates inequality A != B <>  This is the synonym for != A <> B <  This denotes less than A < B >  This indicates greater than A > B <= This refers to less than or equals to A <= B >= This specifies greater than or equals to A >= B IS NULL This indicates that the operand is NULL A IS NULL IS NOT NULL The operand is not NULL A IS NOT NULL <=> This denotes that the operands are equal, or both are NULL A <=> B BETWEEN ... AND This specifies that the left operand is within a range of values A BETWEEN B AND C NOT BETWEEN ... AND This indicates that the left operand is outside the specified range A NOT BETWEEN B AND C IN This denotes that the left operand is one of the items in a given list A IN (B, C, D) NOT IN This indicates that the left operand is not in the given list A NOT IN (B, C, D) Here are a couple of examples: SELECT id FROM product WHERE price BETWEEN 100 AND 200; DELETE FROM product WHERE id IN (100, 101, 102); Special attention should be paid to NULL values. Almost all the preceding operators return NULL if any of their operands is NULL. The reason is quite clear, that is, as NULL represents an unknown value, any operation involving a NULL operand returns an unknown result. However, there are some operators specifically designed to work with NULL values. IS NULL and IS NOT NULL checks whether the operand is NULL. The <=> operator is a shortcut for the following code: a = b OR (a IS NULL AND b IS NULL) String operators MariaDB supports certain comparison operators that are specifically designed to work with string values. This does not mean that other operators does not work well with strings. For example, A = B perfectly works if A and B are strings. However, some particular comparisons only make sense with text values. Let's take a look at them. The LIKE operator and its variants This operator is often used to check whether a string starts with a given sequence of characters, if it ends with that sequence, or if it contains the sequence. More generally, LIKE checks whether a string follows a given pattern. Its syntax is: <string_value> LIKE <pattern> The pattern is a string that can contain the following wildcard characters: _ (underscore) means: This specifies any character %: This denotes any sequence of 0 or more characters There is also a way to include these characters without their special meaning: the _ and % sequences represent the a_ and a% characters respectively. For example, take a look at the following expressions: my_text LIKE 'h_' my_text LIKE 'h%' The first expression returns 1 for 'hi', 'ha', or 'ho', but not for 'hey'. The second expression returns 1 for all these strings, including 'hey'. By default, LIKE is case insensitive, meaning that 'abc' LIKE 'ABC' returns 1. Thus, it can be used to perform a case insensitive equality check. To make LIKE case sensitive, the following BINARY keyword can be used: my_text LIKE BINARY your_text The complement of LIKE is NOT LIKE, as shown in the following code: <string_value> NOT LIKE <pattern> Here are the most common uses for LIKE: my_text LIKE 'my%' -- does my_text start with 'my'? my_text LIKE '%my' -- does my_text end with 'my'? my_text LIKE '%my%' -- does my_text contain 'my'? More complex uses are possible for LIKE. For example, the following expression can be used to check whether mail is a valid e-mail address: mail LIKE '_%@_%.__%' The preceding code snippet checks whether mail contains at least one character, a '@' character, at least one character, a dot, at least two characters in this order. In most cases, an invalid e-mail address will not pass this test. Using regular expressions with the REGEXP operator and its variants Regular expressions are string patterns that contain a meta character with special meanings in order to perform match operations and determine whether a given string matches the given pattern or not. The REGEXP operator is somewhat similar to LIKE. It checks whether a string matches a given pattern. However, REGEXP uses regular expressions with the syntax defined by the POSIX standard. Basically, this means that: Many developers, but not all, already know their syntax REGEXP uses a very expressive syntax, so the patterns can be much more complex and detailed REGEXP is much slower than LIKE; this should be preferred when possible The regular expressions syntax is a complex topic, and it cannot be covered in this article. Developers can learn about regular expressions at www.regular-expressions.info. The complement of REGEXP is NOT REGEXP. Logical operators Logical operators can be used to combine truth expressions that form a compound expression that can be true, false, or NULL. Depending on the truth values of its operands, a logical operator can return 1 or 0. MariaDB supports the following logical operators: NOT; AND; OR; XOR The NOT operator NOT is the only logical operator that takes one operand. It inverts its truth value. If the operand is true, NOT returns 0, and if the operand is false, NOT returns 1. If the operand is NULL, NOT returns NULL. Here is an example: NOT 1 -- returns 0 NOT 0 -- returns 1 NOT 1 = 1 -- returns 0 NOT 1 = NULL -- returns NULL NOT 1 <=> NULL -- returns 0 The AND operator AND returns 1 if both its operands are true and 0 in all other cases. Here is an example: 1 AND 1 -- returns 1 0 AND 1 -- returns 0 0 AND 0 -- returns 0 The OR operator OR returns 1 if at least one of its operators is true or 0 if both the operators are false. Here is an example: 1 OR 1 -- returns 1 0 OR 1 -- returns 1 0 OR 0 -- returns 0 The XOR operator XOR stands for eXclusive OR. It is the least used logical operator. It returns 1 if only one of its operators is true or 0 if both the operands are true or false. Take a look at the following example: 1 XOR 1 -- returns 0 1 XOR 0 -- returns 1 0 XOR 1 --returns 1 0 XOR 0 -- returns 0 A XOR B is the equivalent of the following expression: (A OR B) AND NOT (A AND B) Or: (NOT A AND B) OR (A AND NOT B) Arithmetic operators MariaDB supports the operators that are necessary to execute all the basic arithmetic operations. The supported arithmetic operators are: + for additions - for subtractions * for multiplication / for division Depending on the MariaDB configuration, remember that a division by 0 raises an error or returns NULL. In addition, two more operators are useful for divisions: DIV: This returns the integer part of a division without any decimal part or reminder MOD or %: This returns the reminder of a division Here is an example: MariaDB [(none)]> SELECT 20 DIV 3 AS int_part, 20 MOD 3 AS modulus; +----------+---------+ | int_part | modulus | +----------+---------+ | 6 | 2 | +----------+---------+ 1 row in set (0.00 sec) Operators precedence MariaDB does not blindly evaluate the expression from left to right. Every operator has a given precedence. The And operators that is evaluated before another one is said to have a higher precedence. In general, arithmetic and string operators have a higher priority than logical operators. The precedence of arithmetic operators reflect their precedence in common mathematical expressions. It is very important to remember the precedence of logical operators (from the highest to the lowest): NOT AND XOR OR MariaDB supports many operators, and we did not discuss all of them. Also, the exact precedence can slightly vary depending on the MariaDB configuration. The complete precedence can be found in the MariaDB KnowledgeBase, at https://mariadb.com/kb/en/mariadb/documentation/functions-and-operators/operator-precedence/. Parenthesis can be used to force MariaDB to follow a certain order. They are also useful when we do not remember the exact precedence of the operators that we will use, as shown in the following code: (NOT (a AND b)) OR c OR d Summary In this article you learned about the basic transactions and operators. Resources for Article: Further resources on this subject: Set Up MariaDB [Article] Installing MariaDB on Windows and Mac OS X [Article] Building a Web Application with PHP and MariaDB – Introduction to caching [Article]
Read more
  • 0
  • 0
  • 3435

article-image-developing-location-based-services-neo4j
Packt
02 Jun 2015
22 min read
Save for later

Developing Location-based Services with Neo4j

Packt
02 Jun 2015
22 min read
In this article by Ankur Goel, author of the book, Neo4j Cookbook, we will cover the following recipes: Installing the Neo4j Spatial extension Importing the Esri shapefiles Importing the OpenStreetMap files Importing data using the REST API Creating a point layer using the REST API Finding geometries within the bounding box Finding geometries within a distance Finding geometries within a distance using Cypher By definition, any database that is optimized to store and query the data that represents objects defined in a geometric space is called a spatial database. Although Neo4j is primarily a graph database, due to the importance of geospatial data in today's world, the spatial extension has been introduced in Neo4j as an unmanaged extension. It gives you most of the facilities, which are provided by common geospatial databases along with the power of connectivity through edges, which Neo4j, as a graph database, provides. In this article, we will take a look at some of the widely used use cases of Neo4j as a spatial database, and you will learn how typical geospatial operations can be performed on it. Before proceeding further, you need to install Neo4j on your system using one of the recipies that follow here. The installation will depend on your system type. (For more resources related to this topic, see here.) Single node installation of Neo4j over Linux Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as is or can be embedded inside applications as well. The following recipe will show you how to set up a single instance of Neo4j over the Linux operating system. Getting ready Perform the following steps to get started with this recipe: Download the community edition of Neo4j from http://www.neo4j.org/download for the Linux platform: $ wget http://dist.neo4j.org/neo4j-community-2.2.0-M02-unix.tar.gz Check whether JDK/JRE is installed for your operating system or not by typing this in the shell prompt: $ echo $JAVA_HOME If this command throws no output, install Java for your Linux distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the Linux operating system, which is simple, as shown in the following steps: Extract the TAR file using the following command: $ tar –zxvf neo4j-community-2.2.0-M02-unix.tar.gz $ ls Go to the bin directory under the root folder: $ cd <neo4j-community-2.2.0-M02>/bin/ Start the Neo4j graph database server: $ ./neo4j start Check whether Neo4j is running or not using the following command: $ ./neo4j status Neo4j can also be monitored using the web console. Open http://<ip>:7474/webadmin, as shown in the following screenshot: The preceding diagram is a screenshot of the web console of Neo4j through which the server can be monitored and different Cypher queries can be run over the graph database. How it works... Neo4j comes with prebuilt binaries over the Linux operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. See also During installation, you may face several kind of issues, such as max open files and so on. For more information, check out http://neo4j.com/docs/stable/server-installation.html#linux-install. Single node installation of Neo4j over Windows Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as is or can be embedded inside applications. The following recipe will show you how to set up a single instance of Neo4j over the Windows operating system. Getting ready Perform the following steps to get started with this recipe: Download the Windows installer from http://www.neo4j.org/download. This has both 32- and 64-bit prebuilt binaries Check whether Java is installed for the operating system or not by typing this in the cmd prompt: echo %JAVA_HOME% If this command throws no output, install Java for your Windows distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the Windows operating system, which is simple, as shown here: Run the installer by clicking on the downloaded file: The preceding screenshot shows the Windows installer running. After the installation is complete, when you run the software, it will ask for the database location. Carefully choose the location as the entire graph database will be stored in this folder: The preceding screenshot shows the Windows installer asking for the graph database's location. The Neo4j browser can be opened by entering http://localhost:7474/ in the browser. The following screenshot depicts Neo4j started over the Windows platform: How it works... Neo4j comes with prebuilt binaries over the Windows operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. See also During installation, you might face several kinds of issues such as max open files and so on. For more information, check out http://neo4j.com/docs/stable/server-installation.html#windows-install. Single node installation of Neo4j over Mac OS X Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as in a mode and can also be embedded inside applications. The following recipe will show you how to set up a single instance of Neo4j over the OS X operating system. Getting ready Perform the following steps to get started with this recipe: Download the binary version of Neo4j from http://www.neo4j.org/download for the Mac OS X platform and the community edition as shown in the following command: $ wget http://dist.neo4j.org/neo4j-community-2.2.0-M02-unix.tar.gz Check whether Java is installed for the operating system or not by typing this over the cmd prompt: $ echo $JAVA_HOME If this command throws no output, install Java for your Mac OS X distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the OS X operating system, which is very simple, as shown in the following steps: Extract the TAR file using the following command: $ tar –zxvf neo4j-community-2.2.0-M02-unix.tar.gz $ ls Go to the bin directory under the root folder: $ cd <neo4j-community-2.2.0-M02>/bin/ Start the Neo4j graph database server: $ ./neo4j start Check whether Neo4j is running or not using the following command: $ ./neo4j status How it works... Neo4j comes with prebuilt binaries over the OS X operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. There's more… Neo4j over Mac OS X can also be installed using brew, which has been explained here. Run the following commands over the shell: $ brew update $ brew install neo4j After this, Neo4j can be started using the start option with the Neo4j command: $ neo4j start This will start the Neo4j server, which can be accessed from the default URL (http://localhost:7474). The installation can be reached using the following commands: $ cd /usr/local/Cellar/neo4j/ $ cd {NEO4J_VERSION}/libexec/ You can learn more about OS X installation from http://neo4j.com/docs/stable/server-installation.html#osx-install. Due to the limitation of content that can provided in this article, we assume you would already know how to perform the basic operations using Neo4j such as creating a graph, importing data from different formats into Neo4j, the common configurations used for Neo4j. Installing the Neo4j Spatial extension Neo4j Spatial is a library of utilities for Neo4j that facilitates the enabling of spatial operations on the data. Even on the existing data, geospatial indexes can be added and many geospatial operations can be performed on it. In this recipe, you will learn how to install the Neo4j Spatial extension. Getting ready The following steps will get you started with this recipe: Install Neo4j using the earlier recipes in this article. Install the dependencies listed in the pom.xml file for this project from https://github.com/neo4j-contrib/spatial/blob/master/pom.xml. Install Maven using the following command for your operating system: For Debian systems: apt-get install maven For Redhat/Centos systems: yum install apache-maven To install on a Windows-based system, please refer to https://maven.apache.org/guides/getting-started/windows-prerequisites.html. How to do it... Now, let's install the Neo4j Spatial plugin, which is very simple to do, by following these steps: Clone the GitHub repository for spatial extension: git clone git://github.com/neo4j/spatial spatial Move into the spatial directory: cd spatial Build the code using Maven. This will download all the dependencies, compile the library, run the tests, and install the artifacts in the local repository: mvn clean install Move the built artifact to the Neo4j plugin directory: unzip target/neo4j/neo4j-spatial-0.11-SNAPSHOT-server-plugin.zip $NEO4J_ROOT_DIR/plugins/ Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart Check whether the Neo4j Spatial plugin is properly installed or not: curl –L http://<neo4j_server_ip>:<port>/db/data If you are using Neo4j 2.2 or higher, then use the following command: curl --user neo4j:<password> http://localhost:7474/db/data/ The output will look like what is shown in the following screenshot, which shows the Neo4j Spatial plugin installed: How it works... Neo4j Spatial is a library of utilities that helps perform geospatial operations on the dataset, which is present in the Neo4j graph database. You can add geospatial indexes on the existing data and perform operations, such as data within a specified region or within some distance of point of interest. Neo4j Spatial comes as an unmanaged extension, which can be easily installed as well as removed from Neo4j. The extension does not interfere with any of the core functionality. There's more… To read more about Neo4j Spatial extension, we encourage users to visit the GitHub repository at https://github.com/neo4j-contrib/spatial. Also, it will be good to read about the Neo4j unmanaged extension in general (http://neo4j.com/docs/stable/server-unmanaged-extensions.html). Importing the Esri shapefiles The shapefile format is a popular geospatial vector data format for the Geographic Information System (GIS) software. It is developed and regulated by Esri as an open specification for data interoperability among Esri. It is very popular among GIS products, and many times, the data format is in Esri shapefiles. The main file is the .shp file, which contains the geometry data. The binary data file consists of a single, fixed-length header followed by variable-length data records. In this recipe, you will learn how to import the Esri shapefiles in the Neo4j graph database. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Since the Esri shapefile format is, by default, supported by the Neo4j Spatial extension, it is very easy to import data using the Java API from it using the following steps: Download a sample .shp file from http://www.statsilk.com/maps/download-free-shapefile-maps. Execute the following commands: wget http://biogeo.ucdavis.edu/data/diva/adm/AFG_adm.zip unzip AFG_adm.zip mv AFG_adm1.* /data The ShapefileImporter method lets you import data from the Esri shapefile using the following code: GraphDatabaseService esri_database = new GraphDatabaseFactory().newEmbeddedDatabase(storeDir); try {    ShapefileImporter importer = new ShapefileImporter(esri_database); importer.importFile("/data/AFG_adm1.shp", "layer_afganistan");        } finally {            esri_database.shutdown(); } Using similar code, we can import multiple SHP files into the same layer or different layers, as shown in the following code snippet: File dir = new File("/data");      FilenameFilter filter = new FilenameFilter() {          public boolean accept(File dir, String name) {      return name.endsWith(".shp"); }};   File[] listOfFiles = dir.listFiles(filter); for (final File fileEntry : listOfFiles) {      System.out.println("FileEntry Directory "+fileEntry);    try {    importer.importFile(fileEntry.toString(), "layer_afganistan"); } catch(Exception e){    esri_database.shutdown(); }} How it works... The Neo4j Spatial extension natively supports the import of data in the shapefile format. Using the ShapefileImporter method, any SHP file can be easily imported into Neo4j. The ShapefileImporter method takes two arguments: the first argument is the path to the SHP files and the second is the layer in which it should be imported. There's more… We will encourage you to read more about shapefiles and layers in general; for this, please visit the following URLs for more information: http://en.wikipedia.org/wiki/Shapefile http://wiki.openstreetmap.org/wiki/Shapefiles http://www.gdal.org/drv_shapefile.html Importing the OpenStreetMap files OpenStreetMap is a powerhouse of data when it comes to geospatial data. It is a collaborative project to create a free, editable map of the world. OpenStreetMap provides geospatial data in the .osm file format. To read more about .osm files in general, check out http://wiki.openstreetmap.org/wiki/.osm. In this recipe, you will learn how to import the .osm files in the Neo4j graph database. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Since the OSM file format is, by default, supported by the Neo4j Spatial extension, it is very easy to import data from it using the following steps: Download one sample .osm file from http://wiki.openstreetmap.org/wiki/Planet.osm#Downloading. Execute the following commands: wget http://download.geofabrik.de/africa-latest.osm.bz2 bunzip2 africa-latest.osm.bz2 mv africa-latest.osm /data The importfile method lets you import data from the .osm file, as shown in the following code snippet: OSMImporter importer = new OSMImporter("africa"); try {    importer.importFile(osm_database, "/data/botswana-latest.osm", false, 5000, true); } catch(Exception e){    osm_database.shutdown(); } importer.reIndex(osm_database,10000); Using similar code, we can import multiple OSM files into the same layer or different layers, as shown here: File dir = new File("/data");      FilenameFilter filter = new FilenameFilter() {          public boolean accept(File dir, String name) {      return name.endsWith(".osm"); }}; File[] listOfFiles = dir.listFiles(filter); for (final File fileEntry : listOfFiles) {      System.out.println("FileEntry Directory "+fileEntry);    try {importer.importFile(osm_database, fileEntry.toString(), false, 5000, true); importer.reIndex(osm_database,10000); } catch(Exception e){    osm_database.shutdown(); } How it works... This is slightly more complex as it requires two phases: the first phase requires a batch inserter performing insertions into the database, and the second phase requires reindexing of the database with the spatial indexes. There's more… We will encourage you to read more about the OSM file and the batch inserter in general; for this, visit the following URLs: http://en.wikipedia.org/wiki/OpenStreetMap http://wiki.openstreetmap.org/wiki/OSM_file_formats http://neo4j.com/api_docs/2.0.2/org/neo4j/unsafe/batchinsert/BatchInserter.html Importing data using the REST API The recipes that you have learned until now consist of Java code, which is used to import spatial data into Neo4j. However, by using any other programming language, such as Python or Ruby, spatial data can be easily imported into Neo4j using the REST interface. In this recipe, you will learn how to import geospatial data using the REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Using the REST interface is a very simple three-stage process to import the geospatial data into the Neo4j graph database server. For the sake of simplicity, the code of the Python language has been used to explain this recipe, although you can also use curl for this recipe: Create the spatial index, as shown in the following code: # Create geom index url = http://<neo4j_server_ip>:<port>/db/data/index/node/ payload= {    "name" : "geom",    "config" : {        "provider" : "spatial",        "geometry_type" : "point",        "lat" : "lat",        "lon" : "lon"    } } Create nodes as lat/lng and data as properties, as shown in the following code: url = "http://<neo4j_server_ip>:<port>/db/data/node" payload = {'lon': 38.6, 'lat': 67.88, 'name': 'abc'} req = urllib2.Request(url) req.add_header('Content-Type', 'application/json') response = urllib2.urlopen(req, json.dumps(payload)) node = json.loads(response.read())['self'] Add the preceding created node to the geospatial index, as shown in the following code snippet: #add node to geom index url = "http://<neo4j_server_ip>:<port>/db/data/index/node/geom" payload = {'value': 'dummy', 'key': 'dummy', 'uri': node} req = urllib2.Request(url) req.add_header('Content-Type', 'application/json') response = urllib2.urlopen(req, json.dumps(payload)) print response.read() The data will look like what is shown in the following screenshot after the addition of a few more nodes; this screenshot depicts the Neo4j Spatial data that has been imported: The following screenshot depicts the properties of a single node, which has been imported into Neo4j: How it works... Adding geospatial data using the REST API is a three-step process, listed as follows: Create a geospatial index using an endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/index/node/ Add a node to the Neo4j graph database using an endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/node Add the created node to the geospatial index using the endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/index/node/geom There's more… We encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Creating a point layer using the REST API In this recipe, you will learn how to create a point layer using the REST API interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the http://<neo4j_server_ip>:/db/data/ext/ SpatialPlugin/graphdb/addSimplePointlayer endpoint to create a simple point layer. Let's add a simple point layer, as shown in the following code: "layer" : "geom", "lat"   : "lat" , "lon"   : "lon",url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb/addSimplePointlayer payload= { "layer" : "geom", "lat"   : "lat" , "lon"   : "lon", } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the create point in layer query: How it works... Creating a point in the layer query is based on the REST interface, which the Neo4j Spatial plugin already provides with it. There's more… We will encourage you to read more about spatial REST interfaces in general; to do this, visit http://neo4j-contrib.github.io/spatial/. Finding geometries within the bounding box In this recipe, you will learn how to find all the geometries within the bounding box using the spatial REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within the bounding box:http://<neo4j_server_ip>:<port>/db/data/ext/SpatialPlugin/graphdb/findGeometriesInBBox. Let's find the all the geometries, using the following information: "minx" : 0.0, "maxx" : 100.0, "miny" : 0.0, "maxy" : 100.0 url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb payload= { "layer" : "geom", "minx" : 0.0, "maxx" : 100.3, "miny" : 0.0, "maxy" : 100.0 } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the bounding box query: How it works... Finding geometries in the bounding box is based on the REST interface, which the Neo4j Spatial plugin provides. The output of the REST call contains an array of the nodes, containing the node's id, lat/lng, and its incoming/outgoing relationships. In the preceding output, you can see node id54 returned as the output. There's more… We will encourage you to read more about spatial REST interfaces in general; to do this, visit http://neo4j-contrib.github.io/spatial/. Finding geometries within a distance In this recipe, you will learn how to find all the geometries within a distance using the spatial REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within a certain distance: http://<neo4j_server_ip>:<port>/db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance. Let's find all the geometries between the specified distance using the following information: "pointX" : -116.67, "pointY" : 46.89, "distanceinKm" : 500, url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance payload= { "layer" : "geom", "pointY" : 46.8625, "pointX" : -114.0117, "distanceInKm" : 500, } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of a withinDistance query: How it works... Finding geometries within a distance is based on the REST interface that the Neo4j Spatial plugin provides. The output of the REST call contains an array of the nodes, containing the node's id, lat/lng, and its incoming/outgoing relationships. In the preceding output, we can see node id71 returned as the output. There's more… We encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Finding geometries within a distance using Cypher In this recipe, you will learn how to find all the geometries within a distance using the Cypher query. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within a certain distance: http://<neo4j_server_ip>:<port>/db/data/cipher Let's find all the geometries within a distance using a Cypher query: "pointX" : -116.67, "pointY" : 46.89, "distanceinKm" : 500, url = "http://<neo4j_server_ip>:<port>//db/data/cypher payload= { "query" : "START n=node:geom('withinDistance:[46.9163, -114.0905, 500.0]') RETURN n" } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the withinDistance query that uses Cypher: The following is the Cypher output in the Neo4j console: How it works... Cypher comes with a withinDistance query, which takes three parameters: lat, lon, and search distance. There's more… We will encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Summary Developing Location-based Services with Neo4j, teaches you the most important aspect of today's data, location, and how to deal with it in Neo4j. You have learnt how to import geospatial data into Neo4j and run queries, such as proximity searches, bounding boxes, and so on. Resources for Article:   Further resources on this subject: Recommender systems dissected Components [article] Working with a Neo4j Embedded Database [article] Differences in style between Java and Scala code [article]
Read more
  • 0
  • 0
  • 3421

article-image-working-time
Packt
03 Sep 2013
22 min read
Save for later

Working with Time

Packt
03 Sep 2013
22 min read
(For more resources related to this topic, see here.) Time handling features are an important part of every BI system. Programming languages, database systems, they all incorporate various time-related functions and Microsoft SQL Server Analysis Services (SSAS) is no exception there. In fact, that's one of its main strengths. The MDX language has various time-related functions designed to work with a special type of dimension called the Time and its typed attributes. While it's true that some of those functions work with any type of dimension, their usefulness is most obvious when applied to time-type dimensions. An additional prerequisite is the existence of multi-level hierarchies, also known as user hierarchies, in which types of levels must be set correctly or some of the time-related functions will either give false results or will not work at all. In this article we're dealing with typical operations, such as year-to-date calculations, running totals, and jumping from one period to another. We go into detail with each operation, explaining known and less known variants and pitfalls. We will discuss why some time calculations can create unnecessary data for the periods that should not have data at all, and why we should prevent it from happening. We will then show you how to prevent time calculations from having values after a certain point in time. In most BI projects, there are always reporting requirements to show measures for today, yesterday, month-to-date, quarter-to-date, year-to-date, and so on. We have three recipes to explore various ways to calculate today's date, and how to turn it into a set and use MDX's powerful set operations to calculate other related periods. Calculating date and time spans is also a common reporting requirement. Calculating the YTD (Year-To-Date) value In this recipe we will look at how to calculate the Year-To-Date value of a measure, that is, the accumulated value of all dates in a year up to the current member on the date dimension. An MDX function YTD() can be used to calculate the Year-To-Date value, but not without its constraints. In this recipe, we will discuss the constraints when using the YTD() function and also the alternative solutions. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button and check that the target database is Adventure Works DW 2012. In order for this type of calculation to work, we need a dimension marked as Time in the Type property, in the Dimension Structure tab of SSDT. That should not be a problem because almost every database contains at least one such dimension and Adventure Works is no exception here. In this example, we're going to use the Date dimension. We can verify in SSDT that the Date dimension's Type property is set to Time. See the following screenshot from SSDT: Here's the query we'll start from: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1FROM[Adventure Works] Once executed, the preceding query returns reseller sales values for every week in the database. How to do it... We are going to use the YTD() function, which takes only one member expression, and returns all dates in the year up to the specified member. Then we will use the aggregation function SUM() to sum up the Reseller Sales Amount. Follow these steps to create a calculated measure with YTD calculation: Add the WITH block of the query. Create a new calculated measure within the WITH block and name it Reseller Sales YTD. The new measure should return the sum of the measure Reseller Sales Amount using the YTD() function and the current date member of the hierarchy of interest. Add the new measure on axis 0 and execute the complete query: WITH MEMBER [Measures].[Reseller Sales YTD] AS Sum( YTD( [Date].[Calendar Weeks].CurrentMember ), [Measures].[Reseller Sales Amount] ) SELECT { [Measures].[Reseller Sales Amount], [Measures].[Reseller Sales YTD] } ON 0, { [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1 FROM [Adventure Works] The result will include the second column, the one with the YTD values. Notice how the values in the second column increase over time: How it works... The YTD() function returns the set of members from the specified date hierarchy, starting from the first date of the year and ending with the specified member. The first date of the year is calculated according to the level [Calendar Year] marked as Years type in the hierarchy [Calendar Weeks]. In our example, the YTD() value for the member Week 9 CY 2008 is a set of members starting from Week 1 CY 2008 and going up to that member because the upper level containing years is of the Years type. The set is then summed up using the SUM() function and the Reseller Sales Amount measure. If we scroll down, we'll see that the cumulative sum resets every year, which means that YTD() works as expected. In this example we used the most common aggregation function, SUM(), in order to aggregate the values of the measure throughout the calculated set. SUM() was used because the aggregation type of the Reseller Sales Amount measure is Sum. Alternatively, we could have used the Aggregate() function instead. More information about that function can be found later in this recipe. There's more... Sometimes it is necessary to create a single calculation that will work for any user hierarchy of the date dimension. In that case, the solution is to prepare several YTD() functions, each using a different hierarchy, cross join them, and then aggregate that set using a proper aggregation function (Sum, Aggregate, and so on). However, bear in mind that this will only work if all user hierarchies used in the expression share the same year level. In other words, that there is no offset in years among them (such as exists between the fiscal and calendar hierarchies in Adventure Works cube in 2008 R2). Why does it have to be so? Because the cross join produces the set intersection of members on those hierarchies. Sets are generated relative to the position when the year starts. If there is offset in years, it is possible that sets won't have an intersection. In that case, the result will be an empty space. Now let's continue with a couple of working examples. Here's an example that works for both monthly and weekly hierarchies: WITHMEMBER [Measures].[Reseller Sales YTD] ASSum( YTD( [Date].[Calendar Weeks].CurrentMember ) *YTD( [Date].[Calendar].CurrentMember ),[Measures].[Reseller Sales Amount] )SELECT{ [Measures].[Reseller Sales Amount],[Measures].[Reseller Sales YTD] } ON 0,{ [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1FROM[Adventure Works] If we replace [Date].[Calendar Weeks].[Calendar Week].MEMBERS with [Date].[Calendar].[Month].MEMBERS, the calculation will continue to work. Without the cross join part, that wouldn't be the case. Try it in order to see for yourself! Just be aware that if you slice by additional attribute hierarchies, the calculation might become wrong. In short, there are many obstacles to getting the time-based calculation right. It partially depends on the design of the time dimension (which attributes exist, which are hidden, how the relations are defined, and so on), and partially on the complexity of the calculations provided and their ability to handle various scenarios. A better place to define time-based calculation is the MDX script. There, we can define scoped assignments, but that's a separate topic which will be covered later in the recipe, Using utility dimension to implement time-based calculations. In the meantime, here are some articles related to that topic: http://tinyurl.com/MoshaDateCalcs http://tinyurl.com/DateToolDim Inception-To-Date calculation A similar calculation is the Inception-To-Date calculation in which we're calculating the sum of all dates up to the current member, that is, we do not perform a reset at the beginning of every year. In that case, the YTD() part of the expression should be replaced with this: Null : [Date].[Calendar Weeks].CurrentMember Using the argument in the YTD() function The argument of the YTD() function is optional. When not specified, the first dimension of the Time type in the measure group is used. More precisely, the current member of the first user hierarchy with a level of type Years. This is quite convenient in the case of a simple Date dimension; a dimension with a single user hierarchy. In the case of multiple hierarchies or a role-playing dimension, the YTD() function might not work, if we forget to specify the hierarchy for which we expect it to work. This can be easily verified. Omit the [Date].[Calendar Weeks].CurrentMember part in the initial query and see that both columns return the same values. The YTD() function is not working anymore. Therefore, it is best to always use the argument in the YTD() function. Common problems and how to avoid them In our example we used the [Date].[Calendar Weeks] user hierarchy. That hierarchy has the level Calendar Year created from the same attribute. The type of attribute is Years, which can be verified in the Properties pane of SSDT: However, the Date dimension in the Adventure Works cube has fiscal attributes and user hierarchies built from them as well. The fiscal hierarchy equivalent to [Date].[Calendar Weeks] hierarchy is the [Date].[Fiscal Weeks] hierarchy. There, the top level is named Fiscal Year, created from the same attribute. This time, the type of the attribute is FiscalYear, not Year. If we exchange those two hierarchies in our example query, the YTD() function will not work on the new hierarchy. It will return an error: The name of the solution is the PeriodsToDate() function. YTD() is in fact a short version of the PeriodsToDate() function, which works only if the Year type level is specified in a user hierarchy. When it is not so (that is, some BI developers tend to forget to set it up correctly or in the case that the level is defined as, let's say, FiscalYear like in this test), we can use the PeriodsToDate() function as follows: MEMBER [Measures].[Reseller Sales YTD] ASSum( PeriodsToDate( [Date].[Fiscal Weeks].[Fiscal Year],[Date].[Fiscal Weeks].CurrentMember ),[Measures].[Reseller Sales Amount] ) PeriodsToDate() might therefore be used as a safer variant of the YTD() function. YTD() and future dates It's worth noting that the value returned by a SUM-YTD combination is never empty once a value is encountered in a particular year. Only the years with no values at all will remain completely blank for all their descendants. In our example with the [Calendar Weeks] hierarchy, scrolling down to the Week 23 CY 2008, you will see that this is the last week that has reseller sales. However, the Year-To-Date value is not empty for the rest of the weeks for year 2008, as shown in the following screenshot: This can cause problems for the descendants of the member that represents the current year (and future years as well). The NON EMPTY keyword will not be able to remove empty rows, meaning we'll get YTD values in the future. We might be tempted to use the NON_EMPTY_BEHAVIOR operator to solve this problem but it wouldn't help. Moreover, it would be completely wrong to use it, because it is only a hint to the engine which may or may not be used. It is not a mechanism for removing empty values. In short, we need to set some rows to null, those positioned after the member representing today's date. We'll cover the proper approach to this challenge in the recipe, Finding the last date with data. Calculating the YoY (Year-over-Year) growth (parallel periods) This recipe explains how to calculate the value in a parallel period, the value for the same period in a previous year, previous quarter, or some other level in the date dimension. We're going to cover the most common scenario – calculating the value for the same period in the previous year, because most businesses have yearly cycles. A ParallelPeriod() is a function that is closely related to time series. It returns a member from a prior period in the same relative position as a specified member. For example, if we specify June 2008 as the member, Year as the level, and 1 as the lag, the ParallelPeriod() function will return June 2007. Once we have the measure from the prior parallel period, we can calculate how much the measure in the current period has increased or decreased with respect to the parallel period's value. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button, and check that the target database is Adventure Works DW 2012. In this example we're going to use the Date dimension. Here's the query we'll start from: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ [Date].[Fiscal].[Month].MEMBERS } ON 1FROM[Adventure Works] Once executed, the previous query returns the value of Reseller Sales Amount for all fiscal months. How to do it... Follow these steps to create a calculated measure with YoY calculation: Add the WITH block of the query. Create a new calculated measure there and name it Reseller Sales PP. The new measure should return the value of the measure Reseller Sales Amount measure using the ParallelPeriod() function. In other words, the definition of the new measure should be as follows: MEMBER [Measures].[Reseller Sales PP] As( [Measures].[Reseller Sales Amount],ParallelPeriod( [Date].[Fiscal].[Fiscal Year], 1,[Date].[Fiscal].CurrentMember ) ) Specify the format string property of the new measure to match the format of the original measure. In this case that should be the currency format. Create the second calculated measure and name it Reseller Sales YoY %. The definition of that measure should be the ratio of the current member's value against the parallel period member's value. Be sure to handle potential division by zero errors (see the recipe Handling division by zero errors). Include both calculated measures on axis 0 and execute the query, which should look like: WITHMEMBER [Measures].[Reseller Sales PP] As( [Measures].[Reseller Sales Amount],ParallelPeriod( [Date].[Fiscal].[Fiscal Year], 1,[Date].[Fiscal].CurrentMember ) ), FORMAT_STRING = 'Currency'MEMBER [Measures].[Reseller Sales YoY %] Asiif( [Measures].[Reseller Sales PP] = 0, null,( [Measures].[Reseller Sales Amount] /[Measures].[Reseller Sales PP] ) ), FORMAT_STRING = 'Percent'SELECT{ [Measures].[Reseller Sales Amount],[Measures].[Reseller Sales PP],[Measures].[Reseller Sales YoY %] } ON 0,{ [Date].[Fiscal].[Month].MEMBERS } ON 1FROM[Adventure Works] The result will include two additional columns, one with the PP values and the other with the YoY change. Notice how the values in the second column repeat over time and that YoY % ratio shows the growth over time: How it works... The ParallelPeriod() function takes three arguments, a level expression, an index, and a member expression, and all three arguments are optional. The first argument indicates the level on which to look for that member's ancestor, typically the year level like in this example. The second argument indicates how many members to go back on the ancestor's level, typically one, as in this example. The last argument indicates the member for which the function is to be applied. Given the right combination of arguments, the function returns a member that is in the same relative position as a specified member, under a new ancestor. The value for the parallel period's member is obtained using a tuple which is formed with a measure and the new member. In our example, this represents the definition of the PP measure. The growth is calculated as the ratio of the current member's value over the parallel period member's value, in other words, as a ratio of two measures. In our example, that was YoY % measure. In our example we've also taken care of a small detail, setting the FORMAT_STRING to Percent. There's more... The ParallelPeriod() function is very closely related to time series, and typically used on date dimensions. However, it can be used on any type of dimension. For example, this query is perfectly valid: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ ParallelPeriod( [Geography].[Geography].[Country],2,[Geography].[Geography].[State-Province].&[CA]&[US] ) } ON 1FROM[Adventure Works] The query returns Hamburg on rows, which is the third state-province in the alphabetical list of states-provinces under Germany. Germany is two countries back from the USA, whose member California, used in this query, is the third state-province underneath that country in the Geography.Geography user hierarchy. We can verify this by browsing the Geography user hierarchy in the Geography dimension in SQL Server Management Studio, as shown in the following screenshot. The UK one member back from the USA, has only one state-province: England. If we change the second argument to 1 instead, we'll get nothing on rows because there's no third state-province under the UK. Feel free to try it: All arguments of the ParallelPeriod() function are optional. When not specified, the first dimension of type Time in the measure group is used, more precisely, the previous member of the current member's parent. This can lead to unexpected results as discussed in the previous recipe. Therefore, it is recommended that you use all the arguments of the ParallelPeriod() function. ParallelPeriod is not a time-aware function The ParallelPeriod() function simply looks for the member from the prior period based on its relative position to its ancestor. For example, if your hierarchy is missing the first six months in the year 2005, for member January 2006, the function will find July 2005 as its parallel period (lagging by one year) because July is indeed the first month in the year 2005. This is exactly the case in Adventure Works DW SSAS prior to 2012. You can test the following scenario in Adventure Works DW SSAS 2008 R2. In our example we used the [Date].[Fiscal] user hierarchy. That hierarchy has all 12 months in every year which is not the case with the [Date].[Calendar] user hierarchy where there's only six months in the first year. This can lead to strange results. For example, if you search-replace the word "Fiscal" with the word "Calendar" in the query we used in this recipe, you'll get this as the result: Notice how the values are incorrect for the year 2006. That's because the ParallelPeriod() function is not a time-aware function, it merely does what it's designed for taking the member that is in the same relative position. Gaps in your time dimension are another potential problem. Therefore, always make the complete date dimensions, with all 12 months in every year and all dates in them, not just working days or similar shortcuts. Remember, Analysis Services isn't doing the date math. It's just navigating using the member's relative position. Therefore, make sure you have laid a good foundation for that. However, that's not always possible. There's an offset of six months between fiscal and calendar years, meaning if you want both of them as date hierarchies, you have a problem; one of them will not have all of the months in the first year. The solution is to test the current member in the calculation and to provide a special logic for the first year, fiscal or calendar; the one that doesn't have all months in it. This is most efficiently done with a scope statement in the MDX script. Another problem in calculating the YoY value is leap years. Calculating moving averages The moving average, also known as the rolling average, is a statistical technique often used in events with unpredictable short-term fluctuations in order to smooth their curve and to visualize the pattern of behavior. The key to get the moving average is to know how to construct a set of members up to and including a specified member, and to get the average value over the number of members in the set. In this recipe, we're going to look at two different ways to calculate moving averages in MDX. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button and check that the target database is Adventure Works DW 2012. In this example we're going to use the Date hierarchy of the Date dimension. Here's the query we'll start from: SELECT{ [Measures].[Internet Order Count] } ON 0,{ [Date].[Date].[Date].MEMBERS} ON 1FROM[Adventure Works] Execute it. The result shows the count of Internet orders for each date in the Date.Date attribute hierarchy. Our task is to calculate the simple moving average (SMA) for dates in the year 2008 based on the count of orders in the previous 30 days. How to do it... We are going to use the LastPeriods() function with a 30 day moving window, and a member expression, [Date].[Date].CurrentMember, as two parameters, and also the AVG() function, to calculate the moving average of Internet order count in the last 30 days. Follow these steps to calculate moving averages: Add the WHERE part of the query and put the year 2006 inside using any available hierarchy. Add the WITH part and define a new calculated measure. Name it SMA 30. Define that measure using the AVG() and LastPeriods() functions. Test to see if you get a managed query similar to this. If so, execute it: WITHMEMBER [Measures].[SMA 30] ASAvg( LastPeriods( 30, [Date].[Date].CurrentMember ),[Measures].[Internet Order Count] )SELECT{ [Measures].[Internet Order Count],[Measures].[SMA 30] } ON 0,{ [Date].[Date].[Date].MEMBERS } ON 1FROM[Adventure Works]WHERE( [Date].[Calendar Year].&[2008] ) The second column in the result set will represent the simple moving average based on the last 30 days. Our final result will look like the following screenshot: How it works... The moving average is a calculation that uses the moving window of N items for which it calculates the statistical mean, that is, the average value. The window starts with the first item and then progressively shifts to the next one until the whole set of items is passed. The function that acts as the moving window is the LastPeriods() function. It returns N items, in this example, 30 dates. That set is then used to calculate the average orders using the AVG() function. Note that the number of members returned by the LastPeriods() function is equal to the span, 30, starting with the member that lags 30 - 1 from the specified member expression, and ending with the specified member. There's more... Another way of specifying what the LastPeriods() function does is to use a range of members with a range-based shortcut. The last member of the range is usually the current member of the hierarchy on an axis. The first member is the N-1th member moving backwards on the same level in that hierarchy, which can be constructed using the Lag(N-1) function. The following expression employing the Lag() function and a range-based shortcut is equivalent to the LastPeriods() in the preceding example: [Date].[Date].CurrentMember.Lag(29) : [Date].[Date].CurrentMember Note that the members returned from the range-based shortcut are inclusive of both the starting member and the ending member. We can easily modify the moving window scope to fit different requirements. For example, in case we need to calculate a 30-day moving average up to the previous member, we can use this syntax: [Date].[Date].CurrentMember.Lag(30) : [Date].[Date].PrevMember The LastPeriods() function is not on the list of optimized functions on this web page: http://tinyurl.com/Improved2008R2. However, tests show no difference in duration with respect to its range alternative. Still, if you come across a situation where the LastPeriods() function performs slowly, try its range alternative. Finally, in case we want to parameterize the expression (for example, to be used in SQL Server Reporting Services), these would be generic forms of the previous expressions: [Date].[Date].CurrentMember.Lag( @span - @offset ) :[Date].[Date].CurrentMember.Lag( @offset ) And LastPeriods( @span, [Date].[Date].CurrentMember.Lag( @offset ) ) The @span parameter is a positive value which determines the size of the window. The @offset parameter determines how much the right side of the window is moved from the current member's position. This shift can be either a positive or negative value. The value of zero means there is no shift at all, the most common scenario. Other ways to calculate the moving averages The simple moving average is just one of many variants of calculating the moving averages. A good overview of a possible variant can be found in Wikipedia: http://tinyurl.com/WikiMovingAvg MDX examples of other variants of moving averages can be found in Mosha Pasumansky's blog article: http://tinyurl.com/MoshaMovingAvg Moving averages and the future dates It's worth noting that the value returned by the moving average calculation is not empty for dates in future because the window is looking backwards, so that there will always be values for future dates. This can be easily verified by scrolling down in our example using the LastPeriods() function, as shown in the following screenshot: In this case the NON EMPTY keyword will not be able to remove empty rows. We might be tempted to use NON_EMPTY_BEHAVIOR to solve this problem but it wouldn't help. Moreover, it would be completely wrong. We don't want to set all the empty rows to null, but only those positioned after the member representing today's date. We'll cover the proper approach to this challenge in the following recipes. Summary This article presents various time-related functions in MDX language that are designed to work with a special type of dimension called the Time and its typed attributes. Resources for Article: Further resources on this subject: What are SSAS 2012 dimensions and cube? [Article] Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Terms and Concepts Related to MDX [Article]
Read more
  • 0
  • 0
  • 3411

article-image-text-mining-r-part-1
Robi Sen
16 Mar 2015
7 min read
Save for later

Text Mining with R: Part 1

Robi Sen
16 Mar 2015
7 min read
R is rapidly becoming the platform of choice for programmers, scientists, and others who need to perform statistical analysis and data mining. In part this is because R is incredibly easy to learn and with just a few commands you can perform data mining and analysis functions that would be very hard in more general purpose languages like Ruby, .Net, Java, or C++. To demonstrate R’s ease, flexibility, and power we will look at how to use R to look at a collection of tweets from the 2014 super bowl, clear up data via R, turn that data it a document matrix so we can analyze the data, then create a “word cloud” so we can visualize our analysis to look for interesting words. Getting Started To get started you need to download both R and R studio. R can be found here and RStudio can be found here. R and RStudio are available for most major operating systems and you should follow the up to date installation guides on their respective websites. For this example we are going to be using a data set from Techtunk which is rather large. For this article I have taken a small excerpt of techtrunks SuperBowl 2014, over 8 million tweets, and cleaned it up for the article. You can download it from the original data source here. Finally you will need to install the R packages text mining package (tm ) and word cloud package (wordcloud). You can use standard library method to install the packages or just use RStudio to install the packets. Preparing our Data As already stated you can find the total SuperBowl 2014 dataset. That being said, it's very large and broken up into many sets of Pipe Delimited files, and they have the .csv file extension but are not .csv, which can be somewhat awkward to work with. This though is a common problem when working with large data sets. Luckily the data set is broken up into fragments in that usually when you are working with large datasets you do not want to try to start developing against the whole data set rather a small and workable dataset the will let you quickly develop your scripts without being so large and unwieldy that it delays development. Indeed you will find that working the large files provided by Techtunk can take 10’s of minutes to process as is. In cases like this is good to look at the data, figure out what data you want, take a sample set of data, massage it as needed, and then work in it from there until you have your coding working exactly how you want. In our cases I took a subset of 4600 tweets from one of the pipe delimited files, converted the file format to Commas Separated Value, .csv, and saved it as a sample file to work from. You can do the same thing, you should consider using files smaller than 5000 records, however you would like or use the file created for this post here. Visualizing our Data For this post all we want to do is get a general sense of what the more common words are that are being tweeted during the superbowl. A common way to visualize this is with a word cloud which will show us the frequency of a term by representing it in greater size to other words in comparison to how many times it is mentioned in the body of tweets being analyzed. To do this we need to a few things first with our data. First we need to create read in our file and turn our collection of tweets into a Corpus. In general a Corpus is a large body of text documents. In R’s textming package it’s an object that will be used to hold our tweets in memory. So to load our tweets as a corpus into R you can do as shown here: # change this file location to suit your machine file_loc <- "yourfilelocation/largerset11.csv" # change TRUE to FALSE if you have no column headings in the CSV myfile <- read.csv(file_loc, header = TRUE, stringsAsFactors=FALSE) require(tm) mycorpus <- Corpus(DataframeSource(myfile[c("username","tweet")])) You can now simply print your Corpus to get a sense of it. > print(mycorpus) <<VCorpus (documents: 4688, metadata (corpus/indexed): 0/0)>> In this case, VCorpus is an automatic assignment meaning that the Corpus is a Volatile object stored in memory. If you want you can make the Corpus permanent using PCorpus. You might do this if you were doing analysis on actual documents such as PDF’s or even databases and in this case R makes pointers to the documents instead of full document structures in memory. Another method you can use to look at your corpus is inspect() which provides a variety of ways to look at the documents in your corpus. For example using: inspect(mycorpus[1,2]) This might give you a result like: > inspect(mycorpus[1:2]) <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> sstanley84 wow rt thestanchion news fleury just made fraudulent investment karlsson httptco5oi6iwashg [[2]] <<PlainTextDocument (metadata: 7)>> judemgreen 2 hour wait train superbowl time traffic problemsnice job chris christie As such inspect can be very useful in quickly getting a sense of data in your corpus without having to try to print the whole corpus. Now that we have our corpus in memory let's clean it up a little before we do our analysis. Usually you want to do this on documents that you are analyzing to remove words that are not relevant to your analysis such as “stopwords” or words such as and, like, but, if, the, and the like which you don’t care about. To do this with the textmining package you want to use transforms. Transforms essentially various functions to all documents in a corpus and that the form of tm_map(your corpus, some function). For example we can use tm_map like this: mycorpus <- tm_map(mycorpus, removePunctuation) Which will now remove all the punctuation marks from our tweets. We can do some other transforms to clean up our data by converting all the text to lower case, removing stop words, extra whitespace, and the like. mycorpus <- tm_map(mycorpus, removePunctuation) mycorpus <- tm_map(mycorpus, content_transformer(tolower)) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, removeWords, c(stopwords("english"), "news")) Note the last line. In that case we are using the stopwords() method but also adding our own word to it; news. You could append your own list of stopwords in this manner. Summary In this post we have looked at the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In the next post, Part 2, we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.
Read more
  • 0
  • 0
  • 3404
article-image-excel-2010-financials-adding-animations-excel-graphs
Packt
28 Jun 2011
3 min read
Save for later

Excel 2010 Financials: Adding Animations to Excel Graphs

Packt
28 Jun 2011
3 min read
  Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel      Getting ready We will start this recipe with a single column graph demonstrating profitability. The dataset for this graph is cell A1. Cell A1 has the formula =A2/100. Cell A2 contains the number 1000: How to do it... Press Alt + F11 on the keyboard to open the Excel Visual Basic Editor (VBE). Once in the VBE, choose Insert | Module from the file menu. Enter the following code: Sub Animate() Dim x As Integer x = 0 Range("A2").Value = x While x < 1000 x = x + 1 Range("A2").Value = Range("A2").Value + 1 For y = 1 To 500000 Next DoEvents Wend End Sub The code should be formatted within the VBE code window as follows: Save your work and close the VBE. Once back at the worksheet, from the Ribbon, choose View | Macros | View Macros. Select the Animate macro and choose Run. The graph for profitability will now drop to 0 and slowly rise to account for the full profitability. How it works... The graph was set to use the data within cell A1 as the dataset to graph. While there was a value within the cell, the graph shows the column and corresponding value. If you were to manually change the value of cell A1 and press enter, the graph would also change to the new value. It is this update event that we harness within the VBA macro that was added. Sub Animate() Dim x As Integer x = 0 Here we first declare and set any variables that we will need. The value of x will hold the overall profit: Range("A2").Value = x While x < 1000 x = x + 1 Range("A2").Value = Range("A2").Value + 1 For y = 1 To 500000 Next DoEvents Wend Next, we choose cell A2 from the worksheet and set its value to x (which begins as 0). Since cell A1 is set to be A2/100, A1 (which is the graph dataset) is now zero. Using a "while" clause in Excel, we take x and add 1 to its value, then pause a moment using DoEvents, to allow Excel to update the graph, then we repeat adding another value of 1. This is repeated until x is equal to 1000 which when divided by 100 as in the formula for A1, becomes 10. Now, the profitability we end with in our animation graph is 10. End Sub There's more... We can change the height of the profitability graph by simply changing the 1000 in the While x < 1000 line. By setting this number to 900, the graph will only grow to the profit level of nine. Further resources on this subject: Excel 2010 Financials: Using Graphs for Analysis [Article] Load Testing Using Visual Studio 2008 [Article] How to Manage Content in a List in Microsoft Sharepoint [Article] Sage ACT! 2011: Creating a Quick Report [Article] Tips and Tricks: Report Page in IBM Cognos 8 Report Studio [Article]
Read more
  • 0
  • 0
  • 3402

article-image-parallel-computing
Packt
30 Sep 2016
9 min read
Save for later

Parallel Computing

Packt
30 Sep 2016
9 min read
In this article written by Jalem Raj Rohit, author of the book Julia Cookbook, cover the following recipes: Basic concepts of parallel computing Data movement Parallel map and loop operations Channels (For more resources related to this topic, see here.) Introduction In this article, you will learn about performing parallel computing and using it to handle big data. So, some concepts like data movements, sharded arrays, and the map-reduce framework are important to know in order to handle large amounts of data by computing on it using parallelized CPUs. So, all the concepts discussed in this article will help you build good parallel computing and multiprocessing basics, including efficient data handling and code optimization. Basic concepts of parallel computing Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs for carrying out the computations. This style of computation is used when handling large amounts of data and also while running complex algorithms over significantly large data. The computations are executed faster due to the availability of multiple CPUs running them in parallel as well as the direct availability of RAM to each of them. Getting ready Julia has an in-built support for parallel computing and multiprocessing. So, these computations rarely require any external libraries for the task. How to do it… Julia can be started in your local computer using multiple cores of your CPU. So, we will now have multiple workers for the process. This is how you can fire up Julia in the multi-processing mode in your terminal. This creates two worker process in the machine, which means it uses twwo CPU cores for the purpose julia -p 2 The output looks something like this. It might differ for different operating systems and different machines: Now, we will look at the remotecall() function. It takes in multiple arguments, the first one being the process which we want to assign the task to. The next argument would be the function which we want to execute. The subsequent arguments would be the parameters or the arguments of that function which we want to execute. In this example, we will create a 2 x 2 random matrix and assign it to the process number 2. This can be done as follows: task = remotecall(2, rand, 2, 2) The preceding command gives the following output: Now that the remotecall() function for remote referencing has been executed, we will fetch the results of the function through the fetch() function. This can be done as follows: fetch(task) The preceding command gives the following output: Now, to perform some mathematical operations on the generated matrix, we can use the @spawnat macro, which takes in the mathematical operation and the fetch() function. The @spawnat macro actually wraps the expression 5 .+ fetch(task) into an anonymous function and runs it on the second machine This can be done as follows: task2 = @spawnat 5 .+ fetch(task) There is also a function that eliminates the need of using two different functions: remotecall() and fetch(). The remotecall_fetch() function takes in multiple arguments. The first one being the process that the task is being assigned. The next argument is the function which you want to be executed. The subsequent arguments would be the arguments or the parameters of the function that you want to execute. Now, we will use the remote call_fetch() function to fetch an element of the task matrix for a particular index. This can be done as follows: remotecall_fetch(2, getindex, task2, 1, 1) How it works… Julia can be started in the multiprocessing mode by specifying the number of processes needed while starting up the REPL. In this example, we started Julia as a two process mode. The maximum number of processes depends on the number of cores available in the CPU. The remotecall() function helps in selecting a particular process from the running processes in order to run a function or, in fact, any computation for us. The fetch() function is used to fetch the results of the remotecall() function from a common data resource (or the process) for all the running processes. The details of the data source would be covered in the later sections. The results of the fetch() function can also be used for further computations, which can be carried out with the @spawnat macro along with the results of fetch(). This would assign a process for the computation. The remotecall_fetch() function further eliminates the need for the fetch function in case of a direct execution. This has both the remotecall() and fetch() operations built into it. So, it acts as a combination of both the second and third points in this section. Data movement In parallel computing, data movements are quite common and are also a thing to be minimized due to the time and the network overhead due to the movements. In this recipe, we will see how that can be optimized to avoid latency as much as we can. Getting ready To get ready for this recipe, you need to have the Julia REPL started in the multiprocessing mode. This is explained in the Getting ready section of the preceding recipe. How to do it… Firstly, we will see how to do a matrix computation using the @spawn macro, which helps in data movement. So, we construct a matrix of shape 200 x 200 and then try to square it using the @spawn macro. This can be done as follows: mat = rand(200, 200) exec_mat = @spawn mat^2 fetch(exec_mat) The preceding command gives the following output: Now, we will look at an another way to achieve the same. This time, we will use the @spawn macro directly instead of the initialization step. We will discuss the advantages and drawbacks of each method in the How it works… section. So, this can be done as follows: mat = @spawn rand(200, 200)^2 fetch(mat) The preceding command gives the following output: How it works… In this example, we try to construct a 200X200 matrix and then used the @spawn macro to spawn a process in the CPU to execute the same for us. The @spawn macro spawns one of the two processes running, and it uses one of them for the computation. In the second example, you learned how to use the @spawn macro directly without an extra initialization part. The fetch() function helps us fetch the results from a common data resource of the processes. More on this will be covered in the following recipes. Parallel maps and loop operations In this recipe, you will learn a bit about the famous Map Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing. You will learn how to parallelize loops and use reducing functions on them through the several CPUs and machines and the concept of parallel computing, which you learned about in the previous recipes. Getting ready Just like the previous sections, Julia just needs to be running in the multiprocessing mode to follow along the following examples. This can be done through the instructions given in the first section. How to do it… Firstly, we will write a function that takes and adds n random bits. The writing of this function has nothing to do with multiprocessing. So, it has simple Julia functions and loops. This function can be written as follows: Now, we will use the @spawn macro, which we learned previously to run the count_heads() function as separate processes. The count_heads()function needs to be in the same directory for this to work. This can be done as follows: require("count_heads") a = @spawn count_heads(100) b = @spawn count_heads(100) fetch(a) + fetch(b) However, we can use the concept of multi-processing and parallelize the loop directly as well as take the sum. The parallelizing part is called mapping, and the addition of the parallelized bits is called reduction. Thus, the process constitutes the famous Map-Reduce framework. This can be made possible using the @parallel macro, as follows: nheads = @parallel (+) for i = 1:200 Int(rand(Bool)) end How it works… The first function is a simple Julia function that adds random bits with every loop iteration. It was created just for the demonstration of Map-Reduce operations. In the second point, we spawn two separate processes for executing the function and then fetch the results of both of them and add them up. However, that is not really a neat way to carry out parallel computation of functions and loops. Instead, the @parallel macro provides a better way to do it, which allows the user to parallelize the loop and then reduce the computations through an operator, which together would be called the Map-Reduce operation. Channels Channels are like the background plumbing for parallel computing in Julia. They are like the reservoirs from where the individual processes access their data from. Getting ready The requisite is similar to the previous sections. This is mostly a theoretical section, so you just need to run your experiments on your own. For that, you need to run your Julia REPL in a multiprocessing mode. How to do it… Channels are shared queues with a fixed length. They are common data reservoirs for the processes which are running. The channels are like common data resources, which multiple readers or workers can access. They can access the data through the fetch() function, which we already discussed in the previous sections. The workers can also write to the channel through the put!() function. This means that the workers can add more data to the resource, which can be accessed by all the workers running a particular computation. Closing a channel after usage is a good practice to avoid data corruption and unnecessary memory usage. It can be done using the close() function. Summary In this article we covered the basic concepts of parallel computing and data movement that takes place in the network. We also learned about parallel maps and loop operations along with the famous Map Reduce framework. At the end we got a brief understanding of channels and how individual processes access their data from channels. Resources for Article: Further resources on this subject: More about Julia [article] Basics of Programming in Julia [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 3370

article-image-article-movie-recommendation
Packt
16 Jun 2017
14 min read
Save for later

Article: Movie Recommendation

Packt
16 Jun 2017
14 min read
In this article by Robert Layton author of the book Learning Data Mining with Python - Second Edition is the second revision of Learning Data Mining with Python by Robert Layton improves upon the first book with updated examples, more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. (For more resources related to this topic, see here.) Affinity Analysis Affinity Analysis is the task of determining when objects are used in similar ways. We focused on whether the objects themselves are similar. The data for Affinity Analysis are often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together as a way to recommend products to users that they might purchase. Other use cases for Affinity Analysis include: Fraud detection Customer segmentation Software optimization Product recommendations Affinity Analysis is usually much more exploratory than classification. At the very least, we often simply rank the results and choose the top 5 recommendations (or some other number), rather than expect the algorithm to give us a specific answer. Algorithms for Affinity Analysis A brute force solution, testing all possible combinations, is not efficient enough for real-world use. We could expect even a small store to have hundreds of items for sale, while many online stores would have thousands (or millions!). As we add more items, the time it takes to compute all rules increases significantly faster. Specifically, the total possible number of rules is 2n - 1. Even the drastic increase in computing power couldn't possibly keep up with the increases in the number of items stored online. Therefore, we need algorithms that work smarter, as opposed to computers that work harder. The Apriori algorithm addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward. The intuition behind Apriori is both simple and clever. First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D). Apriori discovers larger frequent itemsets by building off smaller frequent itemsets. The picture below outlines the full process: The Movie Recommendation Problem Product recommendation is a big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers. Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews. The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this article is the MovieLens 100K dataset (with 100,000 reviews). Download this dataset and unzip it in your data folder. Start a new Jupyter Notebook and type the following code: import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k") ratings_filename = os.path.join(data_folder, "u.data") Ensure that ratings_filename points to the u.data file in the unzipped folder. Loading with pandas The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None) and to set the column names with given values. Let's look at the following code: all_ratings = pd.read_csv(ratings_filename, delimiter="t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"]) While we won't use it in this article, you can properly parse the date timestamp using the following line. Dates for reviews can be an important feature in recommendation prediction, as movies that are rated together often have more similar rankings than movies ranked separately. Accounting for this can improve models significantly. all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s') Understanding the Apriori algorithm and its implementation The goal of this article is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie. To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie: all_ratings["Favorable"] = all_ratings["Rating"] > 3 We will sample our dataset to form a training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users: ratings = all_ratings[all_ratings['UserID'].isin(range(200))] Next, we can create a dataset of only the favorable reviews in our sample: favorable_ratings = ratings[ratings["Favorable"]] We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group: favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"]) In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code. Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review: num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() We can see the top five movies by running the following code: num_favorable_by_movie.sort_values(by="Favorable", ascending=False).head() Implementing the Apriori algorithm On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step. We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code: frequent_itemsets = {} We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value: min_support = 50 To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster set-based operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot). Let's look at the following example of frozenset code: frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support) We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps: from collections import defaultdict def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support): counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for itemset in k_1_itemsets: if itemset.issubset(reviews): for other_reviewed_movie in reviews - itemset: current_superset = itemset | frozenset((other_reviewed_movie,)) counts[current_superset] += 1 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small compared to the average computer), single-pass is a good practice to get into for larger applications. Let's have a look at the core of this function in detail. We iterate through each user, and each of the previously discovered itemsets, and then check if it is a subset of the current set of reviews, which are stored in k_1_itemsets (note that here, k_1 means k-1). If it is, this means that the user has reviewed each movie in the itemset. This is done by the itemset.issubset(reviews) line. We can then go through each individual movie that the user has reviewed (that is not already in the itemset), create a superset by combining the itemset with the new movie and record that we saw this superset in our counting dictionary. These are the candidate frequent itemsets for this value of k. We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those that have a support more than our min_support value. This function forms the heart of our Apriori implementation and we now create a loop that iterates over the steps of the larger algorithm, storing the new itemsets as we increase k from 1 to a maximum value. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code: for k in range(2, 20): # Generate candidates of length k, using the frequent itemsets of length k-1 # Only store the frequent itemsets cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1], min_support) if len(cur_frequent_itemsets) == 0: print("Did not find any frequent itemsets of length {}".format(k)) sys.stdout.flush() break else: print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)) sys.stdout.flush() frequent_itemsets[k] = cur_frequent_itemsets Extracting association rules After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they can easily be converted into these rules. For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise.  candidate_rules = [] for itemset_length, itemset_counts in frequent_itemsets.items(): for itemset in itemset_counts.keys(): for conclusion in itemset: premise = itemset - set((conclusion,)) candidate_rules.append((premise, conclusion)) In these rules, the first partis the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie 79, they are also likely to recommend movie 258. The process of computing confidence starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). We then iterate over all reviews and rules, working out whether the premise of the rule applies and, if it does, whether the conclusion is accurate. correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen: rule_confidence = {candidate_rule: (correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])) for candidate_rule in candidate_rules} Now we can print the top five rules by sorting this confidence dictionary and printing the results: from operator import itemgetter sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True) for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre). We can load the titles from this file using pandas. Additional information about the file and categories is available in the README file that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file. movie_name_filename = os.path.join(data_folder, "u.item") movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman") movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] Let's also create a helper function for finding the name of a movie by its ID: def get_movie_name(movie_id): title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] title = title_object.values[0] return title We can now adjust our previous code for printing out the top rules to also include the titles: for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] premise_names = ", ".join(get_movie_name(idx) for idx in premise) conclusion_name = get_movie_name(conclusion) print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The results gives a recommendation for movies, based on previous movies that person liked. Give it a shot and see if it matches your expectations! Learning Data Mining with Python In this short section of Learning Data Mining with Python, Revision 2, we performed Affinity Analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets. We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. We could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule. We cover topics such as classification, clusters, text analysis, image recognition, TensorFlow and Big Data. Each section comes with a practical real-world example, steps through the code in detail and provides suggestions for your to continue your (machine) learning. Summary In this article we have covered more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. The most recent upgrades to the HTMLG online editor are the tag manager and the attribute filter. Try it for free and purchase a subscription if you like it! Resources for Article: Further resources on this subject: Expanding Your Data Mining Toolbox [article] Data mining [article] Big Data Analysis [article]
Read more
  • 0
  • 0
  • 3366
article-image-consistency-conflicts
Packt
10 Aug 2016
11 min read
Save for later

Consistency Conflicts

Packt
10 Aug 2016
11 min read
In this article by Robert Strickland, author of the book Cassandra 3.x High Availability - Second Edition, we will discuss how for any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you're reading from has not yet received the delete request. (For more resources related to this topic, see here.) Depending on the read_repair_chance setting and the consistency level chosen for the read operation, Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run. How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence. Now, let's discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes. Consistency levels On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations: Consistency level Reads Writes ANY This is not supported for reads. Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down. ONE The replica from the closest node will be returned. Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient. TWO The replicas from the two closest nodes will be returned. The same as ONE, except two replicas must be written. THREE The replicas from the three closest nodes will be returned. The same as ONE, except three replicas must be written. QUORUM Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned. Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers. SERIAL Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read. Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions. LOCAL_ONE Similar to ONE, except that the read will be returned by the closest replica in the local data center. Similar to ONE, except that the write must be acknowledged by at least one node in the local data center. LOCAL_QUORUM Similar to QUORUM, except that only replicas in the local data center are compared. Similar to QUORUM, except the quorum must only be met using the local data center. LOCAL_SERIAL Similar to SERIAL, except only local replicas are used. Similar to SERIAL, except only writes to local replicas must be acknowledged. EACH_QUORUM The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp. The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center. ALL Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned. Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers. As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let's assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure. But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency: Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline. Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when using LOCAL_QUORUM), while still maintaining availability during a node failure. You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it's up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor. If you are unsure about which consistency levels to use for your specific use case, it's typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure. It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date. Repairing data Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in sync. Data repair operations generally fall into three categories: Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn't match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested. Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations. Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days. One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra's mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone. Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don't run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues. In the course of normal operations, Cassandra will repair old replicas when their records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required. With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let's take a closer look at this balance to help bring some additional clarity to the topic. Balancing the replication factor with consistency There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let's presume your desire is to maintain data availability in the case of node failure. It's important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return. Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application's consistency level configurations. In order to assist you in your efforts to achieve this balance, let's consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations (where RF corresponds to the replication factor): RF Write CL Read CL Consistency Availability Use cases 1 ONE QUORUM ALL ONE QUORUM ALL Consistent Doesn't tolerate any replica loss Data can be lost and availability is not critical, such as analysis clusters 2 ONE ONE Eventual Tolerates loss of one replica Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable 2 QUORUM ALL ONE Consistent Tolerates loss of one replica on reads, but none on writes Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies) 2 ONE QUORUM ALL Consistent Tolerates loss of one replica on writes, but none on reads Write-heavy workloads where read consistency is more important than availability 3 ONE ONE Eventual Tolerates loss of two replicas Maximum read and write performance are required, and sometimes returning stale data is acceptable 3 QUORUM ONE Eventual Tolerates loss of one replica on write and two on reads Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable 3 ONE QUORUM Eventual Tolerates loss of two replicas on write and one on reads Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable 3 QUORUM QUORUM Consistent Tolerates loss of one replica Consistency is paramount, while striking a balance between availability and read/write performance 3 ALL ONE Consistent Tolerates loss of two replicas on reads, but none on writes Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability 3 ONE ALL Consistent Tolerates loss of two replicas on writes, but none on reads Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability 3 ANY ONE Eventual Tolerates loss of all replicas on write and two on read Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE) 3 ANY QUORUM Eventual Tolerates loss of all replicas on write and one on read Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable 3 ANY ALL Consistent Tolerates loss of all replicas on writes, but none on reads Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately There are also two additional consistency levels, SERIAL and LOCAL_SERIAL, which can be used to read the latest value, even if it is part of an uncommitted transaction. Otherwise, they follow the semantics of QUORUM and LOCAL_QUORUM, respectively. As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance. Summary In this article, we introduced the foundational concept of consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. Resources for Article: Further resources on this subject: Cassandra Design Patterns [Article] Cassandra Architecture [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 3362

article-image-author-podcast-aleksander-seovic-talks-about-oracle-coherence-35
Packt
17 Feb 2010
1 min read
Save for later

Author Podcast - Aleksander Seovic Talks About Oracle Coherence 3.5

Packt
17 Feb 2010
1 min read
Aleksander Seovic is the author of Oracle Coherence 3.5, which will help you to design and build scalable, reliable, high-performance applications using software of the same name. The book is due out in March, but you can get a flavour of it in his interview with Cameron Purdy, below. For more information on Aleksander's book, visit: http://www.packtpub.com/oracle-coherence-3-5/book. Listen Here      
Read more
  • 0
  • 0
  • 3361
Modal Close icon
Modal Close icon