Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-supervised-learning-classification-and-regression
Packt
06 Apr 2017
30 min read
Save for later

Supervised Learning: Classification and Regression

Packt
06 Apr 2017
30 min read
In this article by Alexey Grigorev, author of the book Mastering Java for Data Science, we will look at how to do pre-processing of data in Java and how to do Exploratory Data Analysis inside and outside Java. Now, when we covered the foundation, we are ready to start creating Machine Learning models. First, we start with Supervised Learning. In the supervised settings we have some information attached to each observation – called labels – and we want to learn from it, and predict it for observations without labels. There are two types of labels: the first are discrete and finite, such as True/False or Buy/Sell, and second are continuous, such as salary or temperature. These types correspond to two types of Supervised Learning: Classification and Regression. We will talk about them in this article. This article covers: Classification problems Regression Problems Evaluation metrics for each type Overview of available implementations in Java (For more resources related to this topic, see here.) Classification In Machine Learning, the classification problem deals with discrete targets with finite set of possible values. The Binary Classification is the most common type of classification problems: the target variable can have only two possible values, such as True/False, Relevant/Not Relevant, Duplicate/Not Duplicate, Cat/Dog and so on. Sometimes the target variable can have more than two outcomes, for example: colors, category of an item, model of a car, and so on, and we call it Multi-Class Classification. Typically each observation can only have one label, but in some settings an observation can be assigned several values. Typically, Multi-Class Classification can be converted to a set of Binary Classification problems, which is why we will mostly concentrate on Binary Classification. Binary Classification Models There are many models for solving the binary classification problem and it is not possible to cover all of them. We will briefly cover the ones that are most often used in practice. They include: Logistic Regression Support Vector Machines Decision Trees Neural Networks We assume that you are already familiar with these methods. Deep familiarity is not required, but for more information you can check the following book: An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie and R. Tibshirani Python Machine Learning by S. Raschka When it comes to libraries, we will cover the following: Smile, JSAT, LIBSVM and LIBLINEAR and Encog. Smile Smile (Statistical Machine Intelligence and Learning Engine) is a library with a large set of classification and other machine learning algorithms. You can see the full list here: https://github.com/haifengl/smile. The library is available on Maven Central and the latest version at the moment of writing is 1.1.0. To include it to your project, add the following dependency: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.1.0</version> </dependency> It is being actively developed; new features and bug fixes are added quite often, but not released as frequently. We recommend to use the latest available version of Smile, and to get it, you will need to build it from the sources. To do it: Install sbt – a tool for building scala projects. You can follow this instruction: http://www.scala-sbt.org/release/docs/Manual-Installation.html Use git to clone the project from https://github.com/haifengl/smile To build and publish the library to local maven repository, run the following command: sbt core/publishM2 The Smile library consists of several sub-modules, such as smile-core, smile-nlp, and smile-plot and so on. So, after building it, add the following dependency to your pom: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.2.0</version> </dependency> The models from Smile expect the data to be in a form of two-dimensional arrays of doubles, and the label information is one dimensional array of integers. For binary models, the values should be 0 or 1. Some models in Smile can handle Multi-Class Classification problems, so it is possible to have more labels. The models are built using the Builder pattern: you create a special class, set some parameters and at the end it returns the object it builds. In Smile this builder class is typically called Trainer, and all models should have a trainer for them. For example, consider training a Random Forest model: double[] X = ...// training data int[] y = ...// 0 and 1 labels RandomForest model = new RandomForest.Trainer() .setNumTrees(100) .setNodeSize(4) .setSamplingRates(0.7) .setSplitRule(SplitRule.ENTROPY) .setNumRandomFeatures(3) .train(X, y); The RandomForest.Trainer class takes in a set of parameters and the training data, and at the end produces the trained Random Forest model. The implementation of Random Forest from Smile has the following parameters: numTrees: number of trees to train in the model. nodeSize: the minimum number of items in the leaf nodes. samplingRate: the ratio of training data used to grow each tree. splitRule: the impurity measure used for selecting the best split. numRandomFeatures: the number of features the model randomly chooses for selecting the best split. Similarly, a logistic regression is trained as follows: LogisticRegression lr = new LogisticRegression.Trainer() .setRegularizationFactor(lambda) .train(X, y); Once we have a model, we can use it for predicting the label of previously unseen items. For that we use the predict method: double[] row = // data int prediction = model.predict(row); This code outputs the most probable class for the given item. However, often we are more interested not in the label itself, but in the probability of having the label. If a model implements the SoftClassifier interface, then it is possible to get these probabilities like this: double[] probs = new double[2]; model.predict(row, probs); After running this code, the probs array will contain the probabilities. JSAT JSAT (Java Statistical Analysis Tool) is another Java library which contains a lot of implementations of common Machine Learning algorithms. You can check the full list of implemented models at https://github.com/EdwardRaff/JSAT/wiki/Algorithms. To include JSAT to a Java project, add this to pom: <dependency> <groupId>com.edwardraff</groupId> <artifactId>JSAT</artifactId> <version>0.0.5</version> </dependency> Unlike Smile, which just takes arrays of doubles, JSAT requires a special wrapper class for data instances. If we have an array, it is converted to the JSAT representation like this: double[][] X = ... // data int[] y = ... // labels // change to more classes for more classes for multi-classification CategoricalData binary = new CategoricalData(2); List<DataPointPair<Integer>> data = new ArrayList<>(X.length); for (int i = 0; i < X.length; i++) { int target = y[i]; DataPoint row = new DataPoint(new DenseVector(X[i])); data.add(new DataPointPair<Integer>(row, target)); } ClassificationDataSet dataset = new ClassificationDataSet(data, binary); Once we have prepared the dataset, we can train a model. Let us consider the Random Forest classifier again: RandomForest model = new RandomForest(); model.setFeatureSamples(4); model.setMaxForestSize(150); model.trainC(dataset); First, we set some parameters for the model, and then, at we end, we call the trainC method (which means “train a classifier”). In the JSAT implementation, Random Forest has fewer options for tuning: only the number of features to select and the number of trees to grow. There are several implementations of Logistic Regression. The usual Logistic Regression model does not have any parameters, and it is trained like this: LogisticRegression model = new LogisticRegression(); model.trainC(dataset); If we want to have a regularized model, then we need to use the LogisticRegressionDCD class (DCD stands for “Dual Coordinate Descent” - this is the optimization method used to train the logistic regression). We train it like this: LogisticRegressionDCD model = new LogisticRegressionDCD(); model.setMaxIterations(maxIterations); model.setC(C); model.trainC(fold.toJsatDataset()); In this code, C is the regularization parameter, and the smaller values of C correspond to stronger regularization effect. Finally, for outputting the probabilities, we can do the following: double[] row = // data DenseVector vector = new DenseVector(row); DataPoint point = new DataPoint(vector); CategoricalResults out = model.classify(point); double probability = out.getProb(1); The class CategoricalResults contains a lot of information, including probabilities for each class and the most likely label. LIBSVM and LIBLINEAR Next we consider two similar libraries: LIBSVM and LIBLINEAR. LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is a library with implementation of Support Vector Machine models, which include support vector classifiers. LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) is a library for fast linear classification algorithms such as Liner SVM and Logistic Regression. Both these libraries come from the same research group and have very similar interfaces. LIBSVM is implemented in C++ and has an officially supported java version. It is available on Maven Central: <dependency> <groupId>tw.edu.ntu.csie</groupId> <artifactId>libsvm</artifactId> <version>3.17</version> </dependency> Note that the Java version of LIBSVM is updated not as often as the C++ version. Nevertheless, the version above is stable and should not contain bugs, but it might be slower than its C++ version. To use SVM models from LIBSVM, you first need to specify the parameters. For this, you create a svm_parameter class. Inside, you can specify many parameters, including: the kernel type (RBF, POLY or LINEAR), the regularization parameter C, probability, which you can set to 1 to be able to get probabilities, the svm_type should be set to C_SVC: this tells that the model should be a classifier. Here is an example of how you can configure an SVM classifier with the Linear kernel which can output probabilities: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.LINEAR; param.probability = 1; param.C = C; // default parameters param.cache_size = 100; param.eps = 1e-3; param.p = 0.1; param.shrinking = 1; The polynomial kernel is specified by the following formula: It has three additional parameters: gamma, coeff0 and degree; and also C – the regularization parameter. You can specify it like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.POLY; param.C = C; param.degree = degree; param.gamma = 1; param.coef0 = 1; param.probability = 1; // plus defaults from the above Finally, the Gaussian kernel (or RBF) has the following formula:' So there is one parameter gamma, which controls the width of the Gaussians. We can specify the model with the RBF kernel like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.RBF; param.C = C; param.gamma = gamma; param.probability = 1; // plus defaults from the above Once we created the configuration object, we need to convert the data in the right format. The LIBSVM command line application reads files in the SVMLight format, so the library also expects sparse data representation. For a single array, the conversion is following: double[] dataRow = // single row vector svm_node[] svmRow = new svm_node[dataRow.length]; for (int j = 0; j < dataRow.length; j++) { svm_node node = new svm_node(); node.index = j; node.value = dataRow[j]; svmRow[j] = node; } For a matrix, we do this for every row: double[][] X = ... // data int n = X.length; svm_node[][] nodes = new svm_node[n][]; for (int i = 0; i < n; i++) { nodes[i] = wrapAsSvmNode(X[i]); } Where wrapAsSvmNode is a function, that wraps a vector into an array of svm_node objects. Now we can put the data and the labels together into svm_problem object: double[] y = ... // labels svm_problem prob = new svm_problem(); prob.l = n; prob.x = nodes; prob.y = y; Now using the data and the parameters we can train the SVM model: svm_model model = svm.svm_train(prob, param);  Once the model is trained, we can use it for classifying unseen data. Getting probabilities is done this way: double[][] X = // test data int n = X.length; double[] results = new double[n]; double[] probs = new double[2]; for (int i = 0; i < n; i++) { svm_node[] row = wrapAsSvmNode(X[i]); svm.svm_predict_probability(model, row, probs); results[i] = probs[1]; } Since we used the param.probability = 1, we can use svm.svm_predict_probability method to predict probabilities. Like Smile, the method takes an array of doubles and writes the results there. Then we can get the probabilities there. Finally, while training, LIBSVM outputs a lot of things on the console. If we are not interested in this output, we can disable it with the following code snippet: svm.svm_set_print_string_function(s -> {}); Just add this in the beginning of your code and you will not see this anymore. The next library is LIBLINEAR, which provides very fast and high performing linear classifiers such as SVM with Linear Kernel and Logistic Regression. It can easily scale to tens and hundreds of millions of data points. Unlike LIBSVM, there is no official Java version of LIBLINEAR, but there is unofficial Java port available at http://liblinear.bwaldvogel.de/. To use it, include the following: <dependency> <groupId>de.bwaldvogel</groupId> <artifactId>liblinear</artifactId> <version>1.95</version> </dependency> The interface is very similar to LIBSVM. First, you define the parameters: SolverType solverType = SolverType.L1R_LR; double C = 0.001; double eps = 0.0001; Parameter param = new Parameter(solverType, C, eps); We have to specify three parameters here: solverType: defines the model which will be used; C: the amount regularization, the smaller C the stronger the regularization; epsilon: tolerance for stopping the training process. A reasonable default is 0.0001; For classification there are the following solvers: Logistic Regression: L1R_LR or L2R_LR SVM: L1R_L2LOSS_SVC or L2R_L2LOSS_SVC According to the official FAQ (which can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html) you should: Prefer SVM to Logistic Regression as it trains faster and usually gives higher accuracy. Try L2 regularization first unless you need a sparse solution – in this case use L1 The default solver is L2-regularized support linear classifier for the dual problem. If it is slow, try solving the primal problem instead. Then you define the dataset. Like previously, let's first see how to wrap a row: double[] row = // data int m = row.length; Feature[] result = new Feature[m]; for (int i = 0; i < m; i++) { result[i] = new FeatureNode(i + 1, row[i]); } Please note that we add 1 to the index – the 0 is the bias term, so the actual features should start from 1. We can put this into a function wrapRow and then wrap the entire dataset as following: double[][] X = // data int n = X.length; Feature[][] matrix = new Feature[n][]; for (int i = 0; i < n; i++) { matrix[i] = wrapRow(X[i]); } Now we can create the Problem class with the data and labels: double[] y = // labels Problem problem = new Problem(); problem.x = wrapMatrix(X); problem.y = y; problem.n = X[0].length + 1; problem.l = X.length; Note that here we also need to provide the dimensionality of the data, and it's the number of features plus one. We need to add one because it includes the bias term. Now we are ready to train the model: Model model = LibLinear.train(fold, param); When the model is trained, we can use it to classify unseen data. In the following example we will output probabilities: double[] dataRow = // data Feature[] row = wrapRow(dataRow); Linear.predictProbability(model, row, probs); double result = probs[1]; The code above works fine for the Logistic Regression model, but it will not work for SVM: SVM cannot output probabilities, so the code above will throw an error for solvers like L1R_L2LOSS_SVC. What you can do instead is get the raw output: double[] values = new double[1]; Feature[] row = wrapRow(dataRow); Linear.predictValues(model, row, values); double result = values[0]; In this case the results will not contain probability, but some real value. When this value is greater than zero, the model predicts that the class is positive. If we would like to map this value to the [0, 1] range, we can use the sigmoid function for that: public static double[] sigmoid(double[] scores) { double[] result = new double[scores.length]; for (int i = 0; i < result.length; i++) { result[i] = 1 / (1 + Math.exp(-scores[i])); } return result; } Finally, like LIBSVM, LIBLINEAR also outputs a lot of things to standard output. If you do not wish to see it, you can mute it with the following code: PrintStream devNull = new PrintStream(new NullOutputStream()); Linear.setDebugOutput(devNull); Here, we use NullOutputStream from Apache IO which does nothing, so the screen stays clean. When to use LIBSVM and when to use LIBLINEAR? For large datasets often it is not possible to use any kernel methods. In this case you should prefer LIBLINEAR. Additionally, LIBLINEAR is especially good for Text Processing purposes such as Document Classification. Encog Finally, we consider a library for training Neural Networks: Encog. It is available on Maven Central and can be added with the following snippet: <dependency> <groupId>org.encog</groupId> <artifactId>encog-core</artifactId> <version>3.3.0</version> </dependency> With this library you first need to specify the network architecture: BasicNetwork network = new BasicNetwork(); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, noInputNeurons)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 30)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 1)); network.getStructure().finalizeStructure(); network.reset(); Here we create a network with one input layer, one inner layer with 30 neurons and one output layer with 1 neuron. In each layer we use sigmoid as the activation function and add the bias input (the true parameter). The last line randomly initializes the weights in the network. For both input and output Encog expects two-dimensional double arrays. In case of binary classification we typically have a one dimensional array, so we need to convert it: double[][] X = // data double[] y = // labels double[][] y2d = new double[y.length][]; for (int i = 0; i < y.length; i++) { y2d[i] = new double[] { y[i] }; } Once the data is converted, we wrap it into special wrapper class: MLDataSet dataset = new BasicMLDataSet(X, y2d); Then this dataset can be used for training: MLTrain trainer = new ResilientPropagation(network, dataset); double lambda = 0.01; trainer.addStrategy(new RegularizationStrategy(lambda)); int noEpochs = 101; for (int i = 0; i < noEpochs; i++) { trainer.iteration(); } There are a lot of other Machine Learning libraries available in Java. For example Weka, H2O, JavaML and others. It is not possible to cover all of them, but you can also try them and see if you like them more than the ones we have covered. Evaluation We have covered many Machine Learning libraries, and many of them implement the same algorithms like Random Forest or Logistic Regression. Also, each individual model can have many different parameters: a Logistic Regression has the regularization coefficient, an SVM is configured by setting the kernel and its parameters. How do we select the best single model out of so many possible variants? For that we first define some evaluation metric and then select the model which achieves the best possible performance with respect to this metric. For binary classification we can select one of the following metrics: Accuracy and Error Precision, Recall and F1 AUC (AU ROC) Result Evaluation K-Fold Cross Validation Training, Validation and Testing. Accuracy Accuracy tells us for how many examples the model predicted the correct label. Calculating it is trivial: int n = actual.length; double[] proba = // predictions; double[] prediction = Arrays.stream(proba).map(p -> p > threshold ? 1.0 : 0.0).toArray(); int correct = 0; for (int i = 0; i < n; i++) { if (actual[i] == prediction[i]) { correct++; } } double accuracy = 1.0 * correct / n; Accuracy is the simplest evaluation metric and everybody understands it. Precision, Recall and F1 In some cases accuracy is not the best measure of model performance. For example, suppose we have an unbalanced dataset: there are only 1% of examples that are positive. Then a model which always predict negative is right in 99% cases, and hence will have accuracy of 0.99. But this model is not useful. There are alternatives to accuracy that can overcome this problem. Precision and Recall are among these metrics: they both look at the fraction of positive items that the model correctly recognized. They can be calculated using the Confusion Matrix: a table which summarizes the performance of a binary classifier: Precision is the fraction of correctly predicted positive items among all items the model predicted positive. In terms of the confusion matrix, Precision is TP / (TP + FP). Recall is the fraction of correctly predicted positive items among items that are actually positive. With values from the confusion matrix, Recall is TP / (TP + FN). It is often hard to decide whether one should optimize Precision or Recall. But there is another metric which combines both Precision and Recall into one number, and it is called F1 score. For calculating Precision and Recall, we first need to calculate the values for the cells of the confusion matrix: int tp = 0, tn = 0, fp = 0, fn = 0; for (int i = 0; i < actual.length; i++) { if (actual[i] == 1.0 && proba[i] > threshold) { tp++; } else if (actual[i] == 0.0 && proba[i] <= threshold) { tn++; } else if (actual[i] == 0.0 && proba[i] > threshold) { fp++; } else if (actual[i] == 1.0 && proba[i] <= threshold) { fn++; } } Then we can use the values to calculate Precision and Recall: double precision = 1.0 * tp / (tp + fp); double recall = 1.0 * tp / (tp + fn); Finally, F1 can be calculated using the following formula: double f1 = 2 * precision * recall / (precision + recall); ROC and AU ROC (AUC) The metrics above are good for binary classifiers which produce hard output: they only tell if the class should be assigned a positive label or negative. If instead our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier. Most of the models can output probabilities of belonging to a certain class, and we can use it to rank examples such that the positive are likely to come first. The ROC Curve visually tells us how good a ranking classifier separates positive examples from negative ones. The way a ROC curve is build is the following: We sort the observations by their score and then starting from the origin we go up if the observation is positive and right if it is negative. This way, in the ideal case, we first always go up, and then always go right – and this will result in the best possible ROC curve. In this case we can say that the separation between positive and negative examples is perfect. If the separation is not perfect, but still OK, the curve will go up for positive examples, but sometimes will turn right when a misclassification occurred. Finally, a bad classifier will not be able to tell positive and negative examples apart and the curve would alternate between going up and right. The diagonal line on the plot represents the baseline – the performance that a random classifier would achieve. The further away the curve from the baseline, the better. Unfortunately, there is no available easy-to-use implementation of ROC curves in Java. So the algorithm for drawing a ROC curve is the following: Let POS be number of positive labels, and NEG be the number of negative labels Order data by the score, decreasing Start from (0, 0) For each example in the sorted order, if the example is positive, move 1 / POS up in the graph, otherwise, move 1 / NEG right in the graph. This is a simplified algorithm and assumes that the scores are distinct. If the scores aren't distinct, and there are different actual labels for the same score, some adjustment needs to be made. It is implemented in the class RocCurve which you will find in the source code. You can use it as following: RocCurve.plot(actual, prediction); Calling it will create a plot similar to this one: The area under the curve says how good the separation is. If the separation is very good, then the area will be close to one. But if the classifier cannot distinguish between positive and negative examples, the curve will go around the random baseline curve, and the area will be close to 0.5. Area Under the Curve is often abbreviated as AUC, or, sometimes, AU ROC – to emphasize that the Curve is a ROC Curve. AUC has a very nice interpretation: the value of AUC corresponds to probability that a randomly selected positive example is scored higher than a randomly selected negative example. Naturally, if this probability is high, our classifier does a good job separating positive and negative examples. This makes AUC a to-go evaluation metric for many cases, especially when the dataset is unbalanced – in the sense that there are a lot more examples of one class than another. Luckily, there are implementations of AUC in Java. For example, it is implemented in Smile. You can use it like this: double[] predicted = ... // int[] truth = ... // double auc = AUC.measure(truth, predicted); Result Validation When learning from data there is always the danger of overfitting. Overfitting occurs when the model starts learning the noise in the data instead of detecting useful patterns. It is always important to check if a model overfits – otherwise it will not be useful when applied to unseen data. The typical and most practical way of checking whether a model overfits or not is to emulate “unseen data” – that is, take a part of the available labeled data and do not use it for training. This technique is called “hold out”: we hold out a part of the data and use it only for evaluation. Often we shuffle the original data set before splitting. In many cases we make a simplifying assumption that the order of data is not important – that is, one observation has no influence on another. In this case shuffling the data prior to splitting will remove effects that the order of items might have. On the other hand, if the data is a Time Series data, then shuffling it is not a good idea, because there is some dependence between observations. So, let us implement the hold out split. We assume that the data that we have is already represented by X – a two-dimensional array of doubles with features and y – a one-dimensional array of labels. First, let us create a helper class for holding the data: public class Dataset { private final double[][] X; private final double[] y; // constructor and getters are omitted } Splitting our dataset should produce two datasets, so let us create a class for that as well: public class Split { private final Dataset train; private final Dataset test; // constructor and getters are omitted } Now suppose we want to split the data into two parts: train and test. We also want to specify the size of the train set, we will do it using a testRatio parameter: the percentage of items that should go to the test set. So the first thing we do is generating an array with indexes and then splitting it according to testRatio: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int trainSize = (int) (indexes.length * (1 - testRatio)); int[] trainIndex = Arrays.copyOfRange(indexes, 0, trainSize); int[] testIndex = Arrays.copyOfRange(indexes, trainSize, indexes.length); We can also shuffle the indexes if we want: Random rnd = new Random(seed); for (int i = indexes.length - 1; i > 0; i--) { int index = rnd.nextInt(i + 1); int tmp = indexes[index]; indexes[index] = indexes[i]; indexes[i] = tmp; } Then we can select instances for the training set as follows: int trainSize = trainIndex.length; double[][] trainX = new double[trainSize][]; double[] trainY = new double[trainSize]; for (int i = 0; i < trainSize; i++) { int idx = trainIndex[i]; trainX[i] = X[idx]; trainY[i] = y[idx]; } And then finally wrap it into our Dataset class: Dataset train = new Dataset(trainX, trainY); If we repeat the same for the test set, we can put both train and test sets into a Split object: Split split = new Split(train, test); And now we can use train fold for training and test fold for testing the models. If we put all the code above into a function of the Dataset class, for example, trainTestSplit, we can use it as follows: Split split = dataset.trainTestSplit(0.2); Dataset train = split.getTrain(); // train the model using train.getX() and train.getY() Dataset test = split.getTest(); // test the model using test.getX(); test.getY(); K-Fold Cross Validation Holding out only one part of the data may not always be the best option. What we can do instead is splitting it into K parts and then testing the models only on 1/Kth of the data. This is called K-Fold Cross-Validation: it not only gives the performance estimation, but also the possible spread of the error. Typically we are interested in models which give good and consistent performance. K-Fold Cross-Validation helps us to select such models. Then we prepare the data for K-Fold Cross-Validation is the following: First, split the data into K parts Then for each of these parts Take one part as the validation set Take the remaining K-1 parts as the training set If we translate this into Java, the first step will look like this: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int[][] foldIndexes = new int[k][]; int step = indexes.length / k; int beginIndex = 0; for (int i = 0; i < k - 1; i++) { foldIndexes[i] = Arrays.copyOfRange(indexes, beginIndex, beginIndex + step); beginIndex = beginIndex + step; } foldIndexes[k - 1] = Arrays.copyOfRange(indexes, beginIndex, indexes.length); This creates an array of indexes for each fold. You can also shuffle the indexes array as previously. Now we can create splits from each fold: List<Split> result = new ArrayList<>(); for (int i = 0; i < k; i++) { int[] testIdx = folds[i]; int[] trainIdx = combineTrainFolds(folds, indexes.length, i); result.add(Split.fromIndexes(dataset, trainIdx, testIdx)); } In the code above we have two additional methods: combineTrainFolds K-1 arrays with indexes and combines them into one Split.fromIndexes creates a split gives train and test indexes. We have already covered the second function when we created a simple hold-out test set. And the first function, combineTrainFolds, is implemented like this: private static int[] combineTrainFolds(int[][] folds, int totalSize, int excludeIndex) { int size = totalSize - folds[excludeIndex].length; int result[] = new int[size]; int start = 0; for (int i = 0; i < folds.length; i++) { if (i == excludeIndex) { continue; } int[] fold = folds[i]; System.arraycopy(fold, 0, result, start, fold.length); start = start + fold.length; } return result; } Again, we can put the code above into a function of the Dataset class and call it like follows: List<Split> folds = train.kfold(3); Now when we have a list of Split objects, we can create a special function for performing Cross-Validation: public static DescriptiveStatistics crossValidate(List<Split> folds, Function<Dataset, Model> trainer) { double[] aucs = folds.parallelStream().mapToDouble(fold -> { Dataset foldTrain = fold.getTrain(); Dataset foldValidation = fold.getTest(); Model model = trainer.apply(foldTrain); return auc(model, foldValidation); }).toArray(); return new DescriptiveStatistics(aucs); } What this function does takes a list of folds and a callback which inside creates a model. Then, after the model is trained, we calculate AUC for it. Additionally, we take advantage of Java's ability to parallelize loops and train models on each fold at the same time. Finally, we put the AUCs calculated on each fold into a DescriptiveStatistics object, which can later on be used to return the mean and the standard deviation of the AUCs. As you probably remember, the DescriptiveStatistics class comes from the Apache Commons Math library. Let us consider an example. Suppose we want to use Logistic Regression from LIBLINEAR and select the best value for the regularization parameter C. We can use the function above this way: double[] Cs = { 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 }; for (double C : Cs) { DescriptiveStatistics summary = crossValidate(folds, fold -> { Parameter param = new Parameter(SolverType.L1R_LR, C, 0.0001); return LibLinear.train(fold, param); }); double mean = summary.getMean(); double std = summary.getStandardDeviation(); System.out.printf("L1 logreg C=%7.3f, auc=%.4f ± %.4f%n", C, mean, std); } Here, LibLinear.train is a helper method which takes a Dataset object and a Parameter object and then trains a LIBLINEAR model. This will print AUC for all provided values of C, so you can see which one is the best, and pick the one with highest mean AUC. Training, Validation and Testing When doing the Cross-Validation there’s still a danger of overfitting. Since we try a lot of different experiments on the same validation set, we might accidentally pick the model which just happened to do well on the validation set – but it may later on fail to generalize to unseen data. The solution to this problem is to hold out a test set at the very beginning and do not touch it at all until we select what we think is the best model. And we use it only for evaluating the final model on it. So how do we select the best model? What we can do is to do Cross-Validation on the remaining train data. It can be either hold out or K-Fold Cross-Validation. In general you should prefer doing K-Fold Cross-Validation because it also gives you the spread of performance, and you may use it in for model selection as well. The typical workflow should be the following: (0) Select some metric for validation, e.g. accuracy or AUC. (1) Split all the data into train and test sets (2) Split the training data further and hold out a validation dataset or split it into K folds (3) Use the validation data for model selection and parameter optimization (4) Select the best model according to the validation set and evaluate it against the hold out test set It is important to avoid looking at the test set too often. It should be used only occasionally for final evaluation to make sure the selected model does not overfit. If the validation scheme is set up properly, the validation score should correspond to the final test score. If this happens, we can be sure that the model does not overfit and is able to generalize to unseen data. Using the classes and the code we created previously, it translates to the following Java code: Dataset data = new Dataset(X, y); Dataset train = split.getTrain(); List<Split> folds = train.kfold(3); // now use crossValidate(folds, ...) to select the best model Dataset test = split.getTest(); // do final evaluation of the best model on test With this information we are ready to do a project on Binary Classification. Summary In this article we spoke about supervised machine learning and about two common supervised problems: Classification and Regression. We also covered the libraries which are commonly used algorithms , implemented and how to evaluate the performance of these algorithms. There is another family of Machine Learning algorithms that do not require the label information: these methods are called Unsupervised Learning. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Clustering and Other Unsupervised Learning Methods [article] Data Clustering [article]
Read more
  • 0
  • 0
  • 5294

article-image-installing-and-configuring-joomla-local-and-remote-servers
Packt
27 May 2010
8 min read
Save for later

Installing and Configuring Joomla! on Local and Remote Servers

Packt
27 May 2010
8 min read
Like every other endeavor in life, there are two ways of installing Joomla!—the easy way and the difficult way. In order to do it the difficult way, you will need to set up your server by yourself before you proceed with the installation. You have the choice of environment to use for your new installation: you may install directly to a live server or you can set up a test environment on your local computer. Minimum system requirements A fully-operational web server (preferably Apache) is required to successfully install and use Joomla!. You also need a database (preferably MySQL) and the server-side scripting language PHP, together with specific modules for MySQL, XML, and Zlib functionality, which are activated within PHP amongst others. Following are the minimum versions of these server components that are required: Software Minimum requirement Recommended Latest options Website PHP 4.3.10 4.4.7 5.x.series http://php.net MySQL 3.23.x or above   5.x series http://dev.mysql.com/downloads/mysql/5.0.html Apache 1.3 or above   2.2 series http://httpd.apache.org Installation on a local computer There are a number of ways to set up a test environment on your own local computer. Most time-pressed developers will install and configure directly onto a live server, but there are good reasons for running your application first on a local development server: Developing your application locally allows you to work on it even if you are not connected to the Internet. The experience that you gain from getting your local server running on a simple installation like WampServer for a machine running Windows, or MAMP for a Mac, will make it easier for you to understand server processes and databases. This knowledge will certainly pay off as you get deeper into Joomla!. The Web is constantly scanned and archived by various search engines, and so will everything that you put on the Web. This information will be around for quite a bit of time and you certainly don't want Google to display your inevitable learning mistakes for the world to see. Installation on WampServer WampServer2 enables you to run Apache, MySQL, and PHP on Windows, and is available for free. You can download it from http://www.wampserver.com. The WampServer2 package comes with the following components already configured to work together: Apache 2.2.11 MySQL 5.1.30 PHP 5.2.8 There are similar packages that already include Joomla! (such as DeveloperSide and WDE). However, note that these may not always have the latest secure version of Joomla!. Therefore, it is better to use WampServer2 and load it with the Joomla! version of your choice. WampServer2 is self-installing. Just double-click on the icon after you've unpacked your zipped download and follow the installation instructions. This gets your development environment ready in just a few minutes. Now let the fun begin! Installing Joomla! 1.5 on localhost Installing Joomla! on localhost can be remarkably straightforward: Download the latest stable release of Joomla!, and unzip the discernible Joomla! 1.5 folder. You will need to use a tool such as WinZip for Windows (http://www.winzip.com) to perform this task. Another free alternative to WinZip is IZArc (http://www.izarc.org/). Locate the directory in which WampServer2 is installed on your computer. This will usually be the root of your computer's main directory (usually C:) and in a folder named wamp. Open it, locate the www folder, and copy your Joomla! 1.5 folder into it. Name the Joomla! folder as per your preference. In the unlikely scenario that you will be installing only one Joomla! test site on your local machine, you may—just for the sake of living to a ripe old age—name it Joomla. Navigate to your computer desktop taskbar tray, click on the WampServer icon, and select the Start All Services option (if this has not yet been selected). Then open your browser and navigate to http://localhost. This will display the main WampServer2 interface and you will see your project, Joomla, listed under Your Projects. However, because Joomla! needs a database to run, you must create the database for our project. Click on the phpMyAdmin link. This will display another interface, this time for phpMyAdmin. Enter the name of your new database in the Create new database field. For the simple reason that we shouldn't live a complicated life, we have given this database the same name (Joomla) as our project and our installation folder. We are now ready to install Joomla!. Joomla! has an automated installation script that automatically populates database tables. The installation script will set the base URL, connect Joomla! to the database, and create tables in the database. Navigate to http://localhost/Joomla and start your installation. You will be presented with a step-by-step guide to the installation in the later pages. Page 1: Choose Language This page asks what language you want to use for the installation. We select English as the language of choice. Click on the Next button to continue with your installation. Page 2: Pre-installation Check Checks are made by the installation script on server configuration. If any of these items are not supported (marked as No), then your system does not meet the minimum requirements for the installation of Joomla!, and you should take the appropriate actions in order to correct the errors. Failing to do so could lead to your Joomla! installation not functioning properly. You are not likely to face this problem on your local installation, and if you do, it can be quickly fixed. Page 3: License The License page is just a page of legalese that the developers of Joomla! really think you must read. It tells you the conditions for use of the free software and also spells out liabilities. Primarily, if the Joomla! installation makes your computer to explode or you start acting funny after you are done with the installation, the suppliers of Joomla! will accept no liability. After correcting any host-specific errors and reading the license, you will be directed to the Database Configuration page. Page 4: Database Configuration On the Database Configuration page, you specify the database where Joomla! will store and retrieve data. Select the type of database from the drop-down list. This will generally be mysql. Also, enter the hostname of the database server that Joomla! will be installed on. This may not necessarily be the same as your web server, so check with your hosting provider if you are not sure. In the present case, it will be localhost. Now enter the MySQL username, password, and database name that you wish to use with Joomla!. These must already exist for the database that you are going to use. We have already determined this when we created the data base. Here are the values for our present project: Host Name: localhost Username: root (unless you have changed it) Password: Leave blank (unless you added a password when you set up your database) Database Name: Joomla (or whatever you have called yours) Table Prefix: If you are installing more than one instance of Joomla! in a single database, then give one of them a prefix (such as jos2_), or else the second instance will not install. Otherwise, you may safely ignore this. Click on the Next button to continue. Page 5: FTP Configuration You can safely omit this step and go on to the next page. But if you do decide to carry out this step, do take notice of the following requirements: As explained on this Joomla! installation page, due to file system permission restrictions on Linux and other Unix systems (and PHP Safe Mode restrictions), an FTP layer is used to handle fi le system manipulation and to enable Joomla! installers. You may enter an FTP username and password with access to the Joomla! root directory. This will be the FTP account that handles all of the file system operations when Joomla! requires FTP access to complete a task. However, for security reasons, if the option is available, you are advised to create a separate FTP user account with access to the Joomla! installation only and not to the entire web server. Page 6: Main Configuration This is where you define your site's public identity. The descriptions of the fields are as follows: Site Name: Enter the name of your Joomla! site. Your E-mail: This will be the e-mail address of the website Super Administrator. Admin Password: Enter a new password and then confirm it, in the appropriate fields. Along with the username admin, this will be the password that you will use to login to the Administrator Control Panel at the end of the installation. Install Sample Data: It is strongly recommended that new Joomla! users install the default sample data. To do this, select this option and click on the button, before proceeding to the next stage. Page 7: Finish That's all! Your installation success screen will be displayed. Congratulations, you are now on your way to becoming a Joomla! Guru. Once the installation script has successfully been completed and you have removed the installation folder, you will be directed to the Joomla! Administration login page—if you want to immediately perform administrative functions on your new site. Otherwise, clicking on the Site icon will direct you to the front page of your site showing all of the sample data—if you have loaded it. That is the most of what you need to know about installing Joomla!.
Read more
  • 0
  • 0
  • 5294

article-image-article-instant-u-torrent
Packt
09 Oct 2013
2 min read
Save for later

Features of uTorrent

Packt
09 Oct 2013
2 min read
Creating a torrent In order to share data, we first need to create and share a torrent. Go to File | Create New Torrent. We use this interface to create new torrents. Here we can select the source files we wish to share, the trackers to use, and configure other sharing behaviors. Click on the Add file / Add directory button. If you plan on sharing a file or sharing a directory, press the relevant button and find the data to share. If you wish to skip files in the directory you have chosen, add which files to skip in the Skip Files textbox. This textbox works by matching the filenames within a directory with a string, which can use wildcards (*) and pipes (|). Note If you want to match a specific file, enter the full filename (for example, filename.jpg). If you want to skip all .jpg files, enter *.jpg. If you want to skip all .jpg files and all .txt files, enter *.jpg|*.txt without spaces. Enter the trackers that this torrent will use. By default, uTorrent will enter trackers www.openbittorrent.com and www.publicbt.com, though you can add other trackers to this list if required. By adding more trackers to this list, the torrent will get more exposure to the potential participants. If you have the torrent contents accessible on a web server and want to list it as a backup location, enter the URL in the Web Seeds textbox. This option ensures the files inside the torrent are accessible even if no other participants are present to transfer data. Use the Comment textbox to enter any extra information that the users may require. Keep the Piece size as (auto detect) . This can be altered if desired, but it is better to leave it as it is, since the Piece size relates to how the files will be divided. If you manually choose 16 KB, the data will be divided into 16 KB chunks. Choosing the wrong piece size for the data size can make torrents inefficient, so it is better to let uTorrent decide. Enable the Start seeding checkbox. This option will make the torrent available to upload as soon as the torrent creation is complete. If you want this torrent to be private, enable the Private torrent checkbox. This will disable the DHT and PEX ( Peer Exchange ) protocols, making tracker the only method of interacting with this torrent.
Read more
  • 0
  • 0
  • 5286

article-image-binding-ms-chart-control-linq-data-source-control-2
Packt
03 Sep 2009
4 min read
Save for later

Binding MS Chart Control to LINQ Data Source Control

Packt
03 Sep 2009
4 min read
Introduction LINQ, short for Language Integrated Query, provides an object oriented approach to not only querying relational databases but also any kind of source such as XML, Collection of objects, etc. LINQ to SQL provides (O/R) object-relational mapping and Visual Studio 2008 IDE provides a (O/R) designer. Visual Studio 2008 also has a web server control called LinqDataSource control. This control requires a DataContext which is provided by the LINQ-to-SQL classes, a class generator that maps SQL objects to the model. Without this control one may have to generate the classes from scratch or by using the SQLMetal.exe utility which generates tables and columns. The readers may benefit by reading my  previous article on Displaying SQL Server Data using a Linq data source. We will continue to use the PrincetonTemp table used in the previous article on MySQL Data Transfer using Sql Server Integration Services. We will map this table to LINQ via LinqDataSource Control and then use it as the source of data for the chart. Create a Framework 3.5 Web Site project Open Visual Studio 2008 from its shortcut on the desktop. Click File | New | Web Site...(or Shift+Alt+N) to open the New Web Site window. Change the default name of the site to a name of your choice (herein PieChartWithLinq) on your local web server as shown. Make sure you are creating a .NET Framework 3.5 web site as shown here. This is very similar to creating a web site that can reside on the file system. Add a LinqDataControl and provide the data context Drag and drop a LinqDataSource control from Toolbox | Data shown in the next figure on to the Default.aspx This creates an instance of the control LinqDataSource1 as shown. The source page Default.aspx will display the  control as shown in the next figure. When you try to configure a data source for this control using the smart tasks (see Figure 3) the program would take you to a window which does not allow you create a new data context. But if there are existing items the program allows you to choose. Create a data context for the LinqDataSource control We will add a new data context. Right click the web site in Solution Explorer and pick Add New Item... from the drop-down menu. This opens the Add New Item window as shown. Highlight the item Linq to SQL Classes in the Add New Item window. Replace the default name DataClasses.dbml with one of your own. Herein it is replaced with Princeton.dbml. When you click OK after renaming you will get a Microsoft Visual Studio warning as shown. Read the message on this window. Click Yes to add Princeton.dbml consisting of two items to the App_Code folder on the web site project as shown. One is a layout designer and the other a code page. The (O/R) designer containing two panes is shown in the next figure. In the left pane you drag and drop table (s) from the server explorer assuming you a made a connection with a database. To the right you can add a stored procedure from the server explorer. Read the instructions on this page. Click (View | Server Explorer) to display the connections in the server explorer as shown. In the next figure you can see the expanded node of TestNorthwind displaying the table PrincetonTemp in the SQL Server 2008 Hodentek2Mysorian. The database and the server names are the ones used for this article and you may have different names in your machine. The connection used is already present but you can create a new connection in the server explorer window if you choose to do so. Drag and drop the PrincetonTemp table to the left pane of the (O/R) designer as shown. Build the project. This will make the objects available for the LinqDataSource1 on the default page.
Read more
  • 0
  • 0
  • 5286

article-image-oracle-siebel-crm-8-configuring-navigation
Packt
26 Apr 2011
6 min read
Save for later

Oracle Siebel CRM 8: Configuring Navigation

Packt
26 Apr 2011
6 min read
Oracle Siebel CRM 8 Developer's Handbook Understanding drilldown objects In Siebel CRM, a drilldown is the activity of clicking on a hyperlink, which typically leads to a more detailed view of the record where the hyperlink originated. The standard Siebel CRM applications provide many examples for drilldown objects, which can mainly be found on list applets such as in the following screenshot that shows the Opportunity List Applet: The Opportunity List Applet allows the end user to click on the opportunity name or the account name. Clicking on the Opportunity Name navigates to the Opportunity Detail - Contacts View in the same screen while clicking on the Account name navigates to the Account Detail - Contacts View on the Accounts screen. Siebel CRM supports both static and dynamic drilldown destinations. The Opportunity List Applet (in Siebel Industry Applications) defines dynamic drilldown destinations for the opportunity name column depending on the name of the product line associated with the opportunity. We can investigate this behavior by creating a test opportunity record and setting its Product Line field (in the More Info view) to Equity. When we now drill down on the Opportunity Name, we observe that the FINCORP Deal Equity View is the new navigation target, allowing the end user to provide detailed equity information for the opportunity. To test this behavior, we must use the Siebel Sample Database for Siebel Industry Applications (SIA) and log in as SADMIN. We can now inspect the Opportunity List Applet in Siebel Tools. Every applet that provides drilldown functionality has at least one definition for the Drilldown Object child type. To view the Drilldown Object definitions for the Opportunity List Applet we can follow the following procedure: Navigate to the Opportunity List Applet. In the Object Explorer, expand the Applet type and select the Drilldown Objects type. Inspect the list of Drilldown Object Definitions. The following screenshot shows the drilldown object definitions for the Opportunity List Applet: We can observe that a drilldown object defines a Hyperlink Field and a (target) View. These and other properties of drilldown objects are described in more detail later in this section. There are various instances of drilldown objects visible in the previous screenshot that reference the Name field. One instance—named Line of Business defines dynamic drilldown destinations that can be verified by expanding the Drilldown Object type in the Object Explorer and selecting the Dynamic Drilldown Destination type (with the Line of Business drilldown object selected). The following screenshot shows the dynamic drilldown destination child object definitions for the Line of Business drilldown object: The child list has been filtered to show only active records and the list is sorted by the Sequence property. Dynamic Drilldown Destinations define a Field of the applet's underlying business component and a Value. The Siebel application verifies the Field and Value for the current record and—if a matching dynamic drilldown destination record is found—uses the Destination Drilldown Object to determine the target view for the navigation. When no match is found, the view in the parent drilldown object is used for navigation. When we investigate the drilldown object named Primary Account, we learn that it defines a Source Field and a target business component, which is a necessity when the drilldown's target View uses a different business object than the View in which the applet is situated. In order to enable the Siebel application to retrieve the record in the target View, a source field that carries the ROW_ID of the target record and the business component to query must be specified. The following table describes the most important properties of the Drilldown Object type: The following table describes the most important properties for the Dynamic Drilldown Destination type: Creating static drilldowns In the following section, we will learn how to create static drilldowns from list and form applets. Case study example: static drilldown from list applet The AHA Customer Documents List Applet (Download code - Ch:9), which provides a unified view for all quotes, orders, opportunities, and so on, associated with an account. The applet should provide drilldown capability to the documents and the employee details of the responsible person. In the following procedure, we describe how to create a static drilldown from the AHA Customer Documents List Applet to the Relationship Hierarchy View (Employee), which displays the reporting hierarchy and employee details: Navigate to the AHA Customer Documents List Applet. Check out or lock the applet if necessary. In the Object Explorer, expand the Applet type, and select the Drilldown Object type. In the Drilldown Objects list, create a new record and provide the following property values: Name: Responsible Employee Hyperlink Field: Responsible User Login Name View: Relationship Hierarchy View (Employee) Source Field: Responsible User Id Business Component: Employee Destination Field: Id Visibility Type: All Compile the AHA Customer Documents List Applet. We will continue to work on the AHA Customer Documents List Applet later in this article. Creating drilldown hyperlinks on form applets Sometimes it is necessary to provide a drilldown hyperlink on a form applet. The following procedure describes how to accomplish this using the SIS Account Entry Applet as an example. The applet will provide a hyperlink that allows quick navigation to the Account Detail - Activities View: Navigate to the Account business component. Check out or lock the business component if necessary. Add a new field with the following properties: Name: AHA Drilldown Field 1 Calculated: TRUE Calculated Value: "Drilldown 1" (include the parentheses) Compile the Account business component. Did you know? We should create a dummy field like in the previous example to avoid interference with standard fields when creating drilldowns on form applets. This field will be referenced in the drilldown object and control. Navigate to the SIS Account Entry Applet. Check out or lock the applet if necessary. In the Object Explorer, expand the Applet type and select the Drilldown Object type. Create a new entry in the Drilldown Objects list with the following properties: Name: AHA Activity Drilldown Hyperlink Field: AHA Drilldown Field 1 View: Account Detail - Activities View In the Object Explorer, select the Control type. In the Controls list, create a new record with the following properties: Name: AHA Activity Drilldown Caption: Go to Activities Field: AHA Drilldown Field 1 HTML Type: Link Method Invoked: Drilldown Right-click the SIS Account Entry Applet in the top list and select Edit Web Layout to open the layout editor. Drag the AHA Activities Drilldown control from the Controls | Columns window to the grid layout and drop it below the Zip Code text box. Save the changes and close the web layout editor. Compile the SIS Account Entry Applet. Log in to the Siebel client and navigate to the Account List view. Click on the Go to Activities link in the form applet and verify that the activities list is displayed for the selected account. The following screenshot shows the result of the previous configuration procedure in the Siebel Web Client: Clicking the Go to Activities hyperlink on the form applet will navigate the user to the activities list view for the current account.
Read more
  • 0
  • 0
  • 5282

article-image-aspnet-repeater-control
Packt
23 Oct 2009
6 min read
Save for later

The ASP.NET Repeater Control

Packt
23 Oct 2009
6 min read
The Repeater control in ASP.NET is a data-bound container control that can be used to automate the display of a collection of repeated list items. These items can be bound to either of the following data sources: Database Table XML File In a Repeater control, the data is rendered as DataItems that are defined using one or more templates. You can even use HTML tags such as <li>, <ul>, or <div> if required. Similar to the DataGrid, DataList, or GridView controls, the Repeater control has a DataSource property that is used to set the DataSource of this control to any ICollection, IEnumerable, or IListSource instance. Once this is set, the data from one of these types of data sources can be easily bound to the Repeater control using itsDataBind() method. However, the Repeater control by itself does not support paging or editing of data. The Repeater control is light weight and does not contain so many features as the DataGrid contains. However, it enables you to place HTML code in its templates. It is great in situations where you need to display the data quickly and format the data to be displayed easily. Using the Repeater Control The Repeater control is a data-bound control that uses templates to display data. It does not have any built-in support for paging, editing, or sorting of the data that is rendered through one or more of its templates. The Repeater control works by looping through the records in your data source and then repeating the rendering of one of its templates called the ItemTemplate, one that contains the records that the control needs to render. To use this control, drag and drop the control in the design view of the web form onto a web form from the toolbox. Refer to the following screenshot: You can also drag and drop the Repeater control from the toolbox onto the source view directly. This is shown in the following screenshot: For customizing the behavior of this control, you have to use the built-in templates that this control comes with. These templates are actually blocks of HTML code. The Repeater control contains the following five templates: HeaderTemplate ItemTemplate AlternatingItemTemplate SeparatorTemplate FooterTemplate The following screenshot shows how a Repeater control looks when populated with data. Note that the templates (Header, Item, Footer, Alternate and Separator) have all been used. The following code snippet is an example of the order in which the templates of the Repeater control are used. <asp:Repeater id="repEmployee" runat="server"><HeaderTemplate>...</HeaderTemplate><ItemTemplate></ItemTemplate><FooterTemplate>...</FooterTemplate></asp:Repeater> When the Repeater control is bound to a data source, the data from the data source is displayed using the ItemTemplate element and any other optional elements, if used. Note that the contents of the HeaderTemplate and the FooterTemplate are rendered once for each Repeater control. The contents of the ItemTemplate are rendered for each record in the control. You can also use the additional AlternatingItemTemplate element after the ItemTemplate element for specifying the appearance of each alternate record. You can also use the SeparatorTemplate element between each record for specifying the separators for the records. Displaying Data Using the Repeater Control This section discusses how we can display data using the Repeater control. As discussed earlier, the Repeater control uses templates for formatting the data that it displays. The following code snippet displays the code in an .aspx file that contains a Repeater control. Note that we would be making use of templates and that the data would be bound to the control from the code-behind file using the DataManager class. The Repeater control is populated with data in the Page_Load event by reusing the DataManager(). Note how the SeparatorTemplate and the AlternatingItemTemplate have been used in the previous code example. Further, the DataBinder.Eval() method has been used to display the values of the corresponding fields from the data container, (in our case, the DataSet instance) in the Repeater control. The FooterTemplate uses the Total Records variable and substitutes its value to display the total number of records displayed by the control. The following is the output on execution. The Header and the Footer templates of the Repeater control are still rendered even if the data source does not contain any data. If you want to suppress their display, you can use the Visible property of the Repeater control and use it to suppress the display of these templates with a simple logic. Here is how you specify the Visible property of this control in your .aspx file to achieve this_Visible="<%# Repeater1.Items.Count > 0 %>"When you specify the Visible property as shown here, the Repeater is made visible only if there are records in your data source.   Displaying Checkboxes in a Repeater Control Let us now understand how we can display checkboxes in a Repeater Control and retrieve the number of checked items. We will use a Button control and a Label control in our page. When you click on the Button control, the number of checked items in the Repeater Control will be displayed in the Label control. The output on execution is similar to what is shown in the following screenshot: Here is the code that we will use in the .aspx file to display checkboxes in a Repeater control. The data is bound to the Repeater control in the Page_Load event as follows: Note that we have used the Page.IsPostBack to check whether the page has posted back in the Page_Load method. If you don't bind data by checking whether the page has posted back, the Repeater control will be rebound to data once again after a postback and all the checkboxes in your web page will be reset to the unchecked state. The source code for the click event of the Button control that we have used is as follows: When you execute the application, the Repeater control is displayed with records from the employee table. Now you check one or more of the checkboxes and then click on the Button control just beneath the Repeater control as follows: Note that the number of checked records is displayed in the Label Control. Summary This article discussed the Repeater control and how we can use it in ourASP.NET applications. Though this control does not support all the functionalities of other data controls, like DataGrid and GridView, it is still a good choice if you want faster rendering of data as it is light weight, and is very flexible.
Read more
  • 0
  • 0
  • 5278
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-understanding-mapreduce
Packt
07 Aug 2013
18 min read
Save for later

Understanding MapReduce

Packt
07 Aug 2013
18 min read
(For more resources related to this topic, see here.) Key/value pairs Here we will explain why some operations process and provide the output in terms of key/value pair. What it mean Firstly, we will clarify just what we mean by key/value pairs by highlighting similar concepts in the Java standard library. The java.util.Map interface is the parent of commonly used classes such as HashMap and (through some library backward reengineering) even the original Hashtable. For any Java Map object, its contents are a set of mappings from a given key of a specified type to a related value of a potentially different type. A HashMap object could, for example, contain mappings from a person's name (String) to his or her birthday (Date). In the context of Hadoop, we are referring to data that also comprises keys that relate to associated values. This data is stored in such a way that the various values in the data set can be sorted and rearranged across a set of keys. If we are using key/value data, it will make sense to ask questions such as the following: Does a given key have a mapping in the data set? What are the values associated with a given key? What is the complete set of keys? We will go into Wordcount in detail shortly, but the output of the program is clearly a set of key/value relationships; for each word (the key), there is a count (the value) of its number of occurrences. Think about this simple example and some important features of key/value data will become apparent, as follows: Keys must be unique but values need not be Each value must be associated with a key, but a key could have no values (though not in this particular example) Careful definition of the key is important; deciding on whether or not the counts are applied with case sensitivity will give different results Note that we need to define carefully what we mean by keys being unique here. This does not mean the key occurs only once; in our data set we may see a key occur numerous times and, as we shall see, the MapReduce model has a stage where all values associated with each key are collected together. The uniqueness of keys guarantees that if we collect together every value seen for any given key, the result will be an association from a single instance of the key to every value mapped in such a way, and none will be omitted. Why key/value data? Using key/value data as the foundation of MapReduce operations allows for a powerful programming model that is surprisingly widely applicable, as can be seen by the adoption of Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much data is either intrinsically key/value in nature or can be represented in such a way. It is a simple model with broad applicability and semantics straightforward enough that programs defined in terms of it can be applied by a framework like Hadoop. Of course, the data model itself is not the only thing that makes Hadoop useful; its real power lies in how it uses the techniques of parallel execution, and divide and conquer. We can have a large number of hosts on which we can store and execute data and even use a framework that manages the division of the larger task into smaller chunks, and the combination of partial results into the overall answer. But we need this framework to provide us with a way of expressing our problems that doesn't require us to be an expert in the execution mechanics; we want to express the transformations required on our data and then let the framework do the rest. MapReduce, with its key/value interface, provides such a level of abstraction, whereby the programmer only has to specify these transformations and Hadoop handles the complex process of applying this to arbitrarily large data sets. Some real-world examples To become less abstract, let's think of some real-world data that is key/value pair: An address book relates a name (key) to contact information (value) A bank account uses an account number (key) to associate with the account details (value) The index of a book relates a word (key) to the pages on which it occurs (value) On a computer filesystem, filenames (keys) allow access to any sort of data, such as text, images, and sound (values) These examples are intentionally broad in scope, to help and encourage you to think that key/value data is not some very constrained model used only in high-end data mining but a very common model that is all around us. We would not be having this discussion if this was not important to Hadoop. The bottom line is that if the data can be expressed as key/value pairs, it can be processed by MapReduce. MapReduce as a series of key/value transformations You may have come across MapReduce described in terms of key/value transformations, in particular the intimidating one looking like this: {K1,V1} -> {K2, List<V2>} -> {K3,V3} We are now in a position to understand what this means: The input to the map method of a MapReduce job is a series of key/value pairs that we'll call K1 and V1. The output of the map method (and hence input to the reduce method) is a series of keys and an associated list of values that are called K2 and V2. Note that each mapper simply outputs a series of individual key/value outputs; these are combined into a key and list of values in the shuffle method. The final output of the MapReduce job is another series of key/value pairs, called K3 and V3 These sets of key/value pairs don't have to be different; it would be quite possible to input, say, names and contact details and output the same, with perhaps some intermediary format used in collating the information. Keep this three-stage model in mind as we explore the Java API for MapReduce next. We will first walk through the main parts of the API you will need and then do a systematic examination of the execution of a MapReduce job. The Hadoop Java API for MapReduce Hadoop underwent a major API change in its 0.20 release, which is the primary interface in the 1.0 version. Though the prior API was certainly functional, the community felt it was unwieldy and unnecessarily complex in some regards. The new API, sometimes generally referred to as context objects, for reasons we'll see later, is the future of Java's MapReduce development. Note that caveat: there are parts of the pre-0.20 MapReduce libraries that have not been ported to the new API, so we will use the old interfaces when we need to examine any of these. The 0.20 MapReduce Java API The 0.20 and above versions of MapReduce API have most of the key classes and interfaces either in the org.apache.hadoop.mapreduce package or its subpackages. In most cases, the implementation of a MapReduce job will provide job-specific subclasses of the Mapper and Reducer base classes found in this package. We'll stick to the commonly used K1/K2/K3/ and so on terminology, though more recently the Hadoop API has, in places, used terms such as KEYIN/VALUEIN and KEYOUT/VALUEOUT instead. For now, we will stick with K1/K2/K3 as it helps understand the end-to-end data flow. The Mapper class This is a cut-down view of the base Mapper class provided by Hadoop. For our own mapper implementations, we will subclass this base class and override the specified method as follows: class Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value Mapper.Context context) throws IOException, InterruptedException {..} } Although the use of Java generics can make this look a little opaque at first, there is actually not that much going on. The class is defined in terms of the key/value input and output types, and then the map method takes an input key/value pair in its parameters. The other parameter is an instance of the Context class that provides various mechanisms to communicate with the Hadoop framework, one of which is to output the results of a map or reduce method. Notice that the map method only refers to a single instance of K1 and V1 key/ value pairs. This is a critical aspect of the MapReduce paradigm in which you write classes that process single records and the framework is responsible for all the work required to turn an enormous data set into a stream of key/ value pairs. You will never have to write map or reduce classes that try to deal with the full data set. Hadoop also provides mechanisms through its InputFormat and OutputFormat classes that provide implementations of common file formats and likewise remove the need of having to write file parsers for any but custom file types. There are three additional methods that sometimes may be required to be overridden. protected void setup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing. protected void cleanup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing. protected void run( Mapper.Context context) throws IOException, Interrupted Exception This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method . The Reducer class The Reducer base class works very similarly to the Mapper class, and usually requires only subclasses to override a single reduce method . Here is the cut-down class definition: public class Reducer<K2, V2, K3, V3> { void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException {..} } Again, notice the class definition in terms of the broader data flow (the reduce method accepts K2/V2 as input and provides K3/V3 as output) while the actual reduce method takes only a single key and its associated list of values. The Context object is again the mechanism to output the result of the method. This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden: protected void setup( Reduce.Context context) throws IOException, InterruptedException This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing. protected void cleanup( Reducer.Context context) throws IOException, InterruptedException This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing. protected void run( Reducer.Context context) throws IOException, InterruptedException This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method. The Driver class Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver. Don't worry about how each line works, though you should be able to work out generally what each is doing: public class ExampleDriver { ... public static void main(String[] args) throws Exception { // Create a Configuration object that is used to set other options Configuration conf = new Configuration() ; // Create the object representing the job Job job = new Job(conf, "ExampleJob") ; // Set the name of the main class in the job jarfile job.setJarByClass(ExampleDriver.class) ; // Set the mapper class job.setMapperClass(ExampleMapper.class) ; // Set the reducer class job.setReducerClass(ExampleReducer.class) ; // Set the types for the final output key and value job.setOutputKeyClass(Text.class) ; job.setOutputValueClass(IntWritable.class) ; // Set input and output file paths FileInputFormat.addInputPath(job, new Path(args[0])) ; FileOutputFormat.setOutputPath(job, new Path(args[1])) // Execute the job and wait for it to complete System.exit(job.waitForCompletion(true) ? 0 : 1); } }} Given our previous talk of jobs, it is not surprising that much of the setup involves operations on a Job object. This includes setting the job name and specifying which classes are to be used for the mapper and reducer implementations. Certain input/output configurations are set and, finally, the arguments passed to the main method are used to specify the input and output locations for the job. This is a very common model that you will see often. There are a number of default values for configuration options, and we are implicitly using some of them in the preceding class. Most notably, we don't say anything about the file format of the input files or how the output files are to be written. These are defined through the InputFormat and OutputFormat classes mentioned earlier; we will explore them in detail later. The default input and output formats are text files that suit our WordCount example. There are multiple ways of expressing the format within text files in addition to particularly optimized binary formats. A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution. Writing MapReduce programs We have been using and talking about WordCount for quite some time now; let's actually write an implementation, compile, and run it, and then explore some modifications. Time for action – setting up the classpath To compile any Hadoop-related code, we will need to refer to the standard Hadoop-bundled classes. Add the Hadoop-1.0.4.core.jar file from the distribution to the Java classpath as follows: $ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH} What just happened? This adds the Hadoop-1.0.4.core.jar file explicitly to the classpath alongside the current directory and the previous contents of the CLASSPATH environment variable. Once again, it would be good to put this in your shell startup file or a standalone file to be sourced. We will later need to also have many of the supplied third-party libraries that come with Hadoop on our classpath, and there is a shortcut to do this. For now, the explicit addition of the core JAR file will suffice. Time for action – implementing WordCount We will explore our own Java implementation by performing the following steps: Enter the following code into the WordCount1.java file: Import java.io.* ; import org.apache.hadoop.conf.Configuration ; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount1 { public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String[] words = value.toString().split(" ") ; for (String str: words) { word.set(str); context.write(word, one); } } } public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int total = 0; for (IntWritable val : values) { total++ ; } context.write(key, new IntWritable(total)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount1.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Now compile it by executing the following command: $ javac WordCount1.java What just happened? This is our first complete MapReduce job. Look at the structure and you should recognize the elements we have previously discussed: the overall Job class with the driver configuration in its main method and the Mapper and Reducer implementations defined as inner classes. We'll do a more detailed walkthrough of the mechanics of MapReduce in the next section, but for now let's look at the preceding code and think of how it realizes the key/value transformations we talked about earlier. The input to the Mapper class is arguably the hardest to understand, as the key is not actually used. The job specifies TextInputFormat as the format of the input data and, by default, this delivers to the mapper data where the key is the line number in the file and the value is the text of that line. In reality, you may never actually see a mapper that uses that line number key, but it is provided. The mapper is executed once for each line of text in the input source and every time it takes the line and breaks it into words. It then uses the Context object to output (more commonly known as emitting) each new key/value of the form <word, 1 >. These are our K2/V2 values. We said before that the input to the reducer is a key and a corresponding list of values, and there is some magic that happens between the map and reduce methods to collect together the values for each key that facilitates this, which we'll not describe right now. Hadoop executes the reducer once for each key and the preceding reducer implementation simply counts the numbers in the Iterable object and gives output for each word in the form of <word, count>. This is our K3/V3 values. Take a look at the signatures of our mapper and reducer classes: the WordCountMapper class gives IntWritable and Text as input and gives Text and IntWritable as output. The WordCountReducer class gives Text and IntWritableboth as input and output. This is again quite a common pattern, where the map method performs an inversion on the key and values, and instead emits a series of data pairs on which the reducer performs aggregation. The driver is more meaningful here, as we have real values for the parameters. We use arguments passed to the class to specify the input and output locations. Time for action – building a JAR file Before we run our job in Hadoop, we must collect the required class files into a single JAR file that we will submit to the system. Create a JAR file from the generated class files. $ jar cvf wc1.jar WordCount1*class What just happened? We must always package our class files into a JAR file before submitting to Hadoop, be it local or on Elastic MapReduce. Be careful with the JAR command and file paths. If you include in a JAR file class the files from a subdirectory, the class may not be stored with the path you expect. This is especially common when using a catch-all classes directory where all source data gets compiled. It may be useful to write a script to change into the directory, convert the required files into JAR files, and move the JAR files to the required location. Time for action – running WordCount on a local Hadoop cluster Now we have generated the class files and collected them into a JAR file, we can run the application by performing the following steps: Submit the new JAR file to Hadoop for execution. $ hadoop jar wc1.jar WordCount1 test.txt output Check the output file; it should be as follows: $ Hadoop fs –cat output/part-r-00000 This 1 yes 1 a 1 is 2 test 1 this 1 What just happened? This is the first time we have used the Hadoop JAR command with our own code. There are four arguments: The name of the JAR file. The name of the driver class within the JAR file. The location, on HDFS, of the input file (a relative reference to the /user/Hadoop home folder, in this case). The desired location of the output folder (again, a relative path). The name of the driver class is only required if a main class has not (as in this case) been specified within the JAR file manifest.
Read more
  • 0
  • 0
  • 5275

Packt
14 Feb 2014
6 min read
Save for later

CreateJS – Performing Animation and Transforming Function

Packt
14 Feb 2014
6 min read
(For more resources related to this topic, see here.) Creating animations with CreateJS As you may already know, creating animations in web browsers during web development is a difficult job because you have to write code that has to work in all browsers; this is called browser compatibility. The good news is that CreateJS provides modules to write and develop animations in web browsers without thinking about browser compatibility. CreateJS modules can do this job very well and all you need to do is work with CreateJS API. Understanding TweenJS TweenJS is one of the modules of CreateJS that helps you develop animations in web browsers. We will now introduce TweenJS. The TweenJS JavaScript library provides a simple but powerful tweening interface. It supports tweening of both numeric object properties and CSS style properties, and allows you to chain tweens and actions together to create complex sequences.—TweenJS API Documentation What is tweening? Let us understand precisely what tweening means: Inbetweening or tweening is the process of generating intermediate frames between two images to give the appearance that the first image evolves smoothly into the second image.—Wikipedia The same as other CreateJS subsets, TweenJS contains many functions and methods; however, we are going to work with and create examples for specific basic methods, based on which you can read the rest of the documentation of TweenJS to create more complex animations. Understanding API and methods of TweenJS In order to create animations in TweenJS, you don't have to work with a lot of methods. There are a few functions that help you to create animations. Following are all the methods with a brief description: get: It returns a new tween instance. to: It queues a tween from the current values to the target properties. set: It queues an action to set the specified properties on the specified target. wait: It queues a wait (essentially an empty tween). call: It queues an action to call the specified function. play: It queues an action to play (un-pause) the specified tween. pause: It queues an action to pause the specified tween. The following is an example of using the Tweening API: var tween = createjs.Tween.get(myTarget).to({x:300},400). set({label:"hello!"}).wait(500).to({alpha:0,visible:false},1000). call(onComplete); The previous example will create a tween, which: Tweens the target to an x value of 300 with duration 400ms and sets its label to hello!. Waits 500ms. Tweens the target's alpha property to 0with duration 1s and sets the visible property to false. Finally, calls the onComplete function. Creating a simple animation Now, it's time to create our simplest animation with TweenJS. It is a simple but powerful API, which gives you the ability to develop animations with method chaining. Scenario The animation has a red ball that comes from the top of the Canvas element and then drops down. In the preceding screenshot, you can see all the steps of our simple animation; consequently, you can predict what we need to do to prepare this animation. In our animation,we are going to use two methods: get and to. The following is the complete source code for our animation: var canvas = document.getElementById("canvas"); var stage = new createjs.Stage(canvas); var ball = new createjs.Shape(); ball.graphics.beginFill("#FF0000").drawCircle(0, 0, 50); ball.x = 200; ball.y = -50; var tween = createjs.Tween.get(ball) to({ y: 300 }, 1500, createjs.Ease.bounceOut); stage.addChild(ball); createjs.Ticker.addEventListener("tick", stage); In the second and third line of the JavaScript code snippet, two variables are declared, namely; the canvas and stage objects. In the next line, the ball variable is declared, which contains our shape object. In the following line, we drew a red circle with the drawCircle method. Then, in order to set the coordinates of our shape object outside the viewport, we set the x axis to -50 px. After this, we created a tween variable, which holds the Tween object; then, using the TweenJS method chaining, the to method is called with duration of 1500 ms and the y property set to 300 px. The third parameter of the to method represents the ease function of tween, which we set to bounceOut in this example. In the following lines, the ball variable is added to Stage and the tick event is added to the Ticker class to keep Stage updated while the animation is playing. You can also find the Canvas element in line 30, using which all animations and shapes are rendered in this element. Transforming shapes CreateJS provides some functions to transform shapes easily on Stage. Each DisplayObject has a setTransform method that allows the transforming of a DisplayObject (like a circle). The following shortcut method is used to quickly set the transform properties on the display object. All its parameters are optional. Omitted parameters will have the default value set. setTransform([x=0] [y=0] [scaleX=1] [scaleY=1] [rotation=0] [skewX=0] [skewY=0] [regX=0] [regY=0]) Furthermore, you can change all the properties via DisplayObject directly (like scaleY and scaleX) as shown in the following example: displayObject.setTransform(100, 100, 2, 2); An example of Transforming function As an instance of using the shape transforming feature with CreateJS, we are going to extend our previous example: var angle = 0; window.ball; var canvas = document.getElementById("canvas"); var stage = new createjs.Stage(canvas); ball = new createjs.Shape(); ball.graphics.beginFill("#FF0000").drawCircle(0, 0, 50); ball.x = 200; ball.y = 300; stage.addChild(ball); function tick(event) { angle += 0.025; var scale = Math.cos(angle); ball.setTransform(ball.x, ball.y, scale, scale); stage.update(event); } createjs.Ticker.addEventListener("tick", tick); In this example, we have a red circle, similar to the previous example of tweening. We set the coordinates of the circle to 200 and 300 and added the circle to the stage object. In the next line, we have a tick function that transforms the shape of the circle. Inside this function, we have an angle variable that increases with each call. We then set the ballX and ballY coordinate to the cosine of the angle variable. The transforming done is similar to the following screenshot: This is a basic example of transforming shapes in CreateJS, but obviously, you can develop and create better transforming by playing with a shape's properties and values. Summary In this article, we covered how to use animation and transform objects on the page using CreateJS. Resources for Article: Further resources on this subject: Introducing a feature of IntroJs [Article] So, what is Node.js? [Article] So, what is Ext JS? [Article]
Read more
  • 0
  • 0
  • 5275

article-image-migrating-wordpress-blog-middleman-and-deploying-to-amazon-part3
Mike Ball
09 Feb 2015
9 min read
Save for later

Part 3: Migrating a WordPress Blog to Middleman and Deploying to Amazon S3

Mike Ball
09 Feb 2015
9 min read
Part 3: Migrating WordPress blog content and deploying to production In parts 1 and 2 of this series, we created middleman-demo, a basic Middleman-based blog, imported content from WordPress, and deployed middleman-demo to Amazon S3. Now that middleman-demo has been deployed to production, let’s design a continuous integration workflow that automates builds and deployments. In part 3, we’ll cover the following: Testing middleman-demo with RSpec and Capybara Integrating with GitHub and Travis CI Configuring automated builds and deployments from Travis CI If you didn’t follow parts 1 and 2, or you no longer have your original middleman-demo code, you can clone mine and check out the part3 branch:  $ git clone http://github.com/mdb/middleman-demo && cd middleman-demo && git checkout part3 Create some automated tests In software development, the practice of continuous delivery serves to frequently deploy iterative software bug fixes and enhancements, such that users enjoy an ever-improving product. Automated processes, such as tests, assist in rapidly validating quality with each change. middleman-demo is a relatively simple codebase, though much of its build and release workflow can still be automated via continuous delivery. Let’s write some automated tests for middleman-demo using RSpec and Capybara. These tests can assert that the site continues to work as expected with each change. Add the gems to the middleman-demo Gemfile:  gem 'rspec'gem 'capybara' Install the gems: $ bundle install Create a spec directory to house tests: $ mkdir spec As is the convention in RSpec, create a spec/spec_helper.rb file to house the RSpec configuration: $ touch spec/spec_helper.rb Add the following configuration to spec/spec_helper.rb to run middleman-demo during test execution: require "middleman" require "middleman-blog" require 'rspec' require 'capybara/rspec' Capybara.app = Middleman::Application.server.inst do set :root, File.expand_path(File.join(File.dirname(__FILE__), '..')) set :environment, :development end Create a spec/features directory to house the middleman-demo RSpec test files: $ mkdir spec/features Create an RSpec spec file for the homepage: $ touch spec/features/index_spec.rb Let’s create a basic test confirming that the Middleman Demo heading is present on the homepage. Add the following to spec/features/index_spec.rb: require "spec_helper" describe 'index', type: :feature do before do visit '/' end it 'displays the correct heading' do expect(page).to have_selector('h1', text: 'Middleman Demo') end end Run the test and confirm that it passes: $ rspec You should see output like the following: Finished in 0.03857 seconds (files took 6 seconds to load) 1 example, 0 failures Next, add a test asserting that the first blog post is listed on the homepage; confirm it passes by running the rspec command: it 'displays the "New Blog" blog post' do expect(page).to have_selector('ul li a[href="/blog/2014/08/20/new-blog/"]', text: 'New Blog') end As an example, let’s add one more basic test, this time asserting that the New Blog text properly links to the corresponding blog post. Add the following to spec/features/index_spec.rb and confirm that the test passes: it 'properly links to the "New Blog" blog post' do click_link 'New Blog' expect(page).to have_selector('h2', text: 'New Blog') end middleman-demo can be further tested in this fashion. The extent to which the specs test every element of the site’s functionality is up to you. At what point can it be confidently asserted that the site looks good, works as expected, and can be publicly deployed to users? Push to GitHub Next, push your middleman-demo code to GitHub. If you forked my original github.com/mdb/middleman-demo repository, skip this section. 1. Create a GitHub repositoryIf you don’t already have a GitHub account, create one. Create a repository through GitHub’s web UI called middleman-demo. 2. What should you do if your version of middleman-demo is not a git repository?If your middleman-demo is already a git repository, skip to step 3. If you started from scratch and your code isn’t already in a git repository, let’s initialize one now. I’m assuming you have git installed and have some basic familiarity with it. Make a middleman-demo git repository: $ git init && git add . && git commit -m 'initial commit' Declare your git origin, where <your_git_url_from_step_1> is your GitHub middleman-demo repository URL: $ git remote add origin <your_git_url_from_step_1> Push to your GitHub repository: $ git push origin master You’re done; skip step 3 and move on to Integrate with Travis CI. 3. If you cloned my mdb/middleman-demo repository…If you cloned my middleman-demo git repository, you’ll need to add your newly created middleman-demo GitHub repository as an additional remote: $ git remote add my_origin <your_git_url_from_step_1> If you are working in a branch, merge all your changes to master. Then push to your GitHub repository: $ git push -u my_origin master Integrate with Travis CI Travis CI is a distributed continuous integration service that integrates with GitHub. It’s free for open source projects. Let’s configure Travis CI to run the middleman-demo tests when we push to the GitHub repository. Log in to Travis CI First, sign in to Travis CI using your GitHub credentials. Visit your profile. Find your middleman-demo repository in the “Repositories” list. Activate Travis CI for middleman-demo; click the toggle button “ON.” Create a .travis.ymlfile Travis CI looks for a .travis.yml YAML file in the root of a repository. YAML is a simple, human-readable markup language; it’s a popular option in authoring configuration files. The .travis.yml file informs Travis how to execute the project’s build. Create a .travis.yml file in the root of middleman-demo: $ touch .travis.yml Configure Travis CI to use Ruby 2.1 when building middleman-demo. Add the following YAML to the .travis.yml file: language: rubyrvm: 2.1 Next, declare how Travis CI can install the necessary gem dependencies to build middleman-demo; add the following: install: bundle install Let’s also add before_script, which runs the middleman-demo tests to ensure all tests pass in advance of a build: before_script: bundle exec rspec Finally, add a script that instructs Travis CI how to build middleman-demo: script: bundle exec middleman build At this point, the .travis.yml file should look like the following: language: ruby rvm: 2.1 install: bundle install before_script: bundle exec rspec script: bundle exec middleman build Commit the .travis.yml file: $ git add .travis.yml && git commit -m "added basic .travis.yml file" Now, after pushing to GitHub, Travis CI will attempt to install middleman-demo dependencies using Ruby 2.1, run its tests, and build the site. Travis CI’s command build output can be seen here: https://travis-ci.org/<your_github_username>/middleman-demo Add a build status badge Assuming the build passes, you should see a green build passing badge near the top right corner of the Travis CI UI on your Travis CI middleman-demo page. Let’s add this badge to the README.md file in middleman-demo, such that a build status badge reflecting the status of the most recent Travis CI build displays on the GitHub repository’s README. If one does not already exist, create a README.md file: $ touch README.md Add the following markdown, which renders the Travis CI build status badge: [![Build Status](https://travis-ci.org/<your_github_username>/middleman-demo.svg?branch=master)](https://travis-ci.org/<your_github_username>/middleman-demo) Configure continuous deployments Through continuous deployments, code is shipped to users as soon as a quality-validated change is committed. Travis CI can be configured to deploy a middleman-demo build with each successful build. Let’s configure Travis CI to continuously deploy middleman-demo to the S3 bucket created in part 2 of this tutorial series. First, install the travis command-line tools: $ gem install travis Use the travis command-line tools to set S3 deployments. Enter the following; you’ll be prompted for your S3 details (see the example below if you’re unsure how to answer): $ travis setup s3 An example response is: Access key ID: <your_aws_access_key_id> Secret access key: <your_aws_secret_access_key_id> Bucket: <your_aws_bucket> Local project directory to upload (Optional): build S3 upload directory (Optional): S3 ACL Settings (private, public_read, public_read_write, authenticated_read, bucket_owner_read, bucket_owner_full_control): |private| public_read Encrypt secret access key? |yes| yes Push only from <your_github_username>/middleman-demo? |yes| yes This automatically edits the .travis.yml file to include the following deploy information: deploy: provider: s3 access_key_id: <your_aws_access_key_id> secret_access_key: secure: <your_encrypted_aws_secret_access_key_id> bucket: <your_s3_bucket> local-dir: build acl: !ruby/string:HighLine::String public_read on: repo: <your_github_username>/middleman-demo Add one additional option, informing Travis to preserve the build directory for use during the deploy process: skip_cleanup: true The final .travis.yml file should look like the following: language: ruby rvm: 2.1 install: bundle install before_script: bundle exec rspec script: bundle exec middleman build deploy: provider: s3 access_key_id: <your_aws_access_key> secret_access_key: secure: <your_encrypted_aws_secret_access_key> bucket: <your_aws_bucket> local-dir: build skip_cleanup: true acl: !ruby/string:HighLine::String public_read on: repo: <your_github_username>/middleman-demo Confirm that your continuous integration works Commit your changes: $ git add .travis.yml && git commit -m "added travis deploy configuration" Push to GitHub and watch the build output on Travis CI: https://travis-ci.org/<your_github_username>/middleman-demo If all works as expected, Travis CI will run the middleman-demo tests, build the site, and deploy to the proper S3 bucket. Recap Throughout this series, we’ve examined the benefits of static site generators and covered some basics regarding Middleman blogging. We’ve learned how to use the wp2middleman gem to migrate content from a WordPress blog, and we’ve learned how to deploy Middleman to Amazon’s cloud-based Simple Storage Service (S3). We’ve configured Travis CI to run automated tests, produce a build, and automate deployments. Beyond what’s been covered within this series, there’s an extensive Middleman ecosystem worth exploring, as well as numerous additional features. Middleman’s custom extensions seek to extend basic Middleman functionality through third-party gems. Read more about Middleman at Middlemanapp.com. About this author Mike Ball is a Philadelphia-based software developer specializing in Ruby on Rails and JavaScript. He works for Comcast Interactive Media where he helps build web-based TV and video consumption applications. 
Read more
  • 0
  • 0
  • 5275

article-image-making-your-code-better
Packt
14 Feb 2014
8 min read
Save for later

Making Your Code Better

Packt
14 Feb 2014
8 min read
(For more resources related to this topic, see here.) Code quality analysis The fact that you can compile your code does not mean your code is good. It does not even mean it will work. There are many things that can easily break your code. A good example is an unhandled NullReferenceException. You will be able to compile your code, you will be able to run your application, but there will be a problem. ReSharper v8 comes with more than 1400 code analysis rules and more than 700 quick fixes, which allow you to fix detected problems. What is really cool is that ReSharper provides you with code inspection rules for all supported languages. This means that ReSharper not only improves your C# or VB.NET code, but also HTML, JavaScript, CSS, XAML, XML, ASP.NET, ASP.NET MVC, and TypeScript. Apart from finding possible errors, code quality analysis rules can also improve the readability of your code. ReSharper can detect code, which is unused and mark it as grayed, prompts you that maybe you should use auto properties or objects and collection initializers, or use the var keyword instead of an explicit type name. ReSharper provides you with five severity levels for rules and allows you to configure them according to your preference. Code inspection rules can be configured in the ReSharper's Options window. A sample view of code inspection rules with the list of available severity levels is shown in the following screenshot: Background analysis One of the best features in terms of code quality in ReSharper is Background analysis. This means that all the rules are checked as you are writing your code. You do not need to compile your project to see the results of the analysis. ReSharper will display appropriate messages in real time. Solution-wide inspections By default, the described rules are checked locally, which means that they should be checked in the current class. Because of this, ReSharper can mark some code as unused if it is used only locally; for example, there can be any unused private method or some part of code inside your method. These two cases are shown in the following screenshot: Additionally, for local analysis, ReSharper can check some rules in your entire project. To do this, you need to enable Solution-wide inspections. The easiest way to enable Solution-wide inspections is to double-click the circle icon in the bottom-right corner of Visual Studio, as seen in the following screenshot: With enabled Solution-wide inspections, ReSharper can mark the public methods or returned values that are unused. Please note that running Solution-wide inspections can hit Visual Studio’s performance in big projects. In such cases, it is better to disable this feature. Disabling code inspections With ReSharper v8, you can easily mark some part of your code as code that should not be checked by ReSharper. You can do this by adding the following comments: // ReSharper disable all // [your code] // ReSharper restore all All code between these two comments will be skipped by ReSharper in code inspections. Of course, instead of the all word, you can use the name of any ReSharper rule such as UseObjectOrCollectionInitializer. You can also disable ReSharper analysis for a single line with the following comment: // ReSharper disable once UseObjectOrCollectionInitializer ReSharper can generate these comments for you. If ReSharper highlights some issue, then just press Alt + Enter and select Options for “YOUR_RULE“ inspection, as shown in the following screenshot: Code Issues You can also an ad-hoc run code analysis. An ad-hoc analysis can be run on the solution or project level. To run ad-hoc analysis, just navigate to RESHARPER | Inspect | Code Issues in Solution or RESHARPER | Inspect | Code Issues in Current Project from the Visual Studio toolbar. This will display a dialog box that shows us the progress of analysis and will finally display the results in the Inspection Results window. You can filter and group the displayed issues as and when you need to. You can also quickly go to a place where the issue occurs just by double-clicking on it. A sample report is shown in the following screenshot: Eliminating errors and code smells We think you will agree that the code analysis provided by ReSharper is really cool and helps create better code. What is even cooler is that ReSharper provides you with features that can fix some issues automatically. Quick fixes Most errors and issues found by ReSharper can be fixed just by pressing Alt + Enter. This will display a list of the available solutions and let you select the best one for you. Fix in scope Quick fixes described above allow you to fix the issues in one particular place. However, sometimes there are issues that you would like to fix in every file in your project or solution. A great example is removing unused using statements or the this keyword. With ReSharper v8, you do not need to fix such issues manually. Instead, you can use a new feature called Fix in scope. You start as usual by pressing Alt + Enter but instead of just selecting some solution, you can select more options by clicking the small arrow on the right from the available options. A sample usage of the Fix in scope feature is shown in the following screenshot: This will allow you to fix the selected issue with just one click! Structural Search and Replace Even though ReSharper contains a lot of built-in analysis, it also allows you to create your own analyses. You can create your own patterns that will be used to search some structures in your code. This feature is called Structural Search and Replace (SSR). To open the Search with Pattern window, navigate to RESHARPER | Find | Search with Pattern…. A sample window is shown in the following screenshot: You can see two things here: On the left, there is place to write you pattern On the right, there is place to define placeholders In the preceding example, we were looking for if statements to compare them with some false expression. You can now simply click on the Find button and ReSharper will display every code that matches this pattern. Of course, you can also save your patterns. You can create new search patterns from the code editor. Just select some code, click on the right mouse button and select Find Similar Code….This will automatically generate the pattern for this code, which you can easily adjust to your needs. SSR allows you not only to find code based on defined patterns, but also replace it with different code. Click on the Replace button available on the top in the preceding screenshot. This will display a new section on the left called Replace pattern. There, you can write code that will be placed instead of code that matches the defined pattern. For the pattern shown, you can write the following code: if (false = $value$) { $statement$ } This will simply change the order of expressions inside the if statement. The saved patterns can also be presented as Quick fixes. Simply navigate to RESHARPER | Options | Code Inspection | Custom Patterns and set proper severity for your pattern, as shown in the following screenshot: This will allow you to define patterns in the code editor, which is shown in the following screenshot: Code Cleanup ReSharper also allows you to fix more than one issue in one run. Navigate to RESHARPER | Tools | Cleanup Code… from the Visual Studio toolbar or just press Ctrl + E, Ctrl + C. This will display the Code Cleanup window, which is shown in the following screenshot: By clicking on the Run button, ReSharper will fix all issues configured in the selected profile. By default, there are two patterns: Full Cleanup Reformat Code You can add your own pattern by clicking on the Edit Profiles button. Summary Code quality analysis is a very powerful feature in ReSharper. As we have described in this article, ReSharper not only prompts you when something is wrong or can be written better, but also allows you to quickly fix these issues. If you do not agree with all rules provided by ReSharper, you can easily configure them to meet your needs. There are many rules that will open your eyes and show you that you can write better code. With ReSharper, writing better, cleaner code is as easy as just pressing Alt + Enter. Resources for Article: Further resources on this subject: Ensuring Quality for Unit Testing with Microsoft Visual Studio 2010 [Article] Getting Started with Code::Blocks [Article] Core .NET Recipes [Article]
Read more
  • 0
  • 0
  • 5275
article-image-jquery-mobile-organizing-information-list-views
Packt
23 Jun 2011
8 min read
Save for later

jQuery Mobile: Organizing Information with List Views

Packt
23 Jun 2011
8 min read
jQuery Mobile First Look Discover the endless possibilities offered by jQuery Mobile for rapid Mobile Web Development You might have noticed that the vast majority of websites built using jQuery Mobile have their content laid out in very similar ways; sure, they differ in the design, colors, and overall feel, but they all have a list-based layout. There is a way in which we can organize our information and take advantage of each and every space in the browser: information is displayed vertically, one piece under another. There are no sidebars of any kind and links are organized in lists – for a cleaner and tidy look. But list views are also used to actually be a list of information. Some examples may be lists of albums, names, tasks, and so on: after all, our purpose is to build a mobile web application and the majority of services and pages can be organized in a way which closely resembles a list. Basics and conventions for list views Due to the particular nature of lists, list views are coded exactly the same way a standard HTML unordered list would. After all, the purpose of list views is to organize our information in a tidy way, presenting a series of links which are placed one under another; the easiest way to grasp their usefulness is, in my opinion, imagining a music player application. A music player would need a clean enough interface, listing the artists, albums, and songs by name. In order to play a song, the user would need to select an artist, and then choose the album in which the song he wishes to play has been released. To create our first view (artists), we would use the following code. Make sure you add the data-role="listview" attribute to the unordered list tag: <ul data-role="listview"> <li><a href="astra.html">Astra</a></li> <li><a href="zappa.html">Frank Zappa</a></li> <li><a href="tull.html">Jethro Tull</a></li> <li><a href="radiohead.html">Radiohead</a></li> <li><a href="who.html">The Who</a></li> </ul> The jQuery Mobile framework automatically styles the list elements accordingly, and adds a right arrow icon. List elements fill the full width of the browser window: Whenever an item is selected (click/tap event), jQuery Mobile will parse the code inside the list element and issue an AJAX request for the first URL found. The page (obtained via AJAX) is then inserted into the existing DOM and a page transition event is triggered. The default page transition is a slide-left animation; clicking the back button on the newly displayed page will result in a slide-right animation. Choosing the list type as per your requirements A somewhat large variety of lists are available for us to choose from in order to make use of the type of list view that is best suited to our needs. Below are listed (sorry, no pun intended!) the different types of list views along with a brief description of how to use them and what part of code we need to change in order to obtain a certain list view. Nested lists Bearing in mind that list views elements are based on the standard HTML unordered list element, we might be wondering what would happen if we try and create a second list inside a list view. By nesting a ul element inside list items, jQuery Mobile will adopt a different kind of behavior to our list items. Our first step toward the creation of a nested list is removing any link present in the list item, as a click event will show the nested list instead of redirecting to another page. The child list will be put into a new "page" with the title of the parent in the header. We're now implementing nested list elements into our sample music player interface by changing our markup to the following. This way, we are able to browse artists and albums. Please note that we have removed any links to external pages: <ul data-role="listview"> <li>Astra <ul> <li><a href="astra_weirding.html">The Weirding</a></li> </ul> </li> <li>Frank Zappa <ul> <li><a href="zappa_hotrats.html">Hot Rats</a></li> <li><a href="zappa_yellowshark.html">Yellow Shark</a></li> </ul> </li> <li>Jethro Tull <ul> <li><a href="tull_aqualung.html">Aqualung</a></li> <li><a href="tull_thick.html">Thick as a Brick</a></li> </ul> </li> <li>Radiohead <ul> <li><a href="radiohead_ok.html">OK Computer</a></li> <li><a href="radiohead_rainbows.html">In Rainbows</a></li> <li><a href="radiohead_kol.html">The King of Limbs</a></li> </ul> </li> <li>The Who <ul> <li><a href="who_next.html">Who's Next</a></li> <li><a href="who_q.html">Quadrophenia</a></li> <li><a href="who_tommy.html">Tommy</a></li> </ul> </li> </ul> If we clicked on the Radiohead element, we would then be able to see the following page: By default, child list will be given a Swatch B theme to indicate they are at a secondary level of navigation; we can select a different color swatch by specifying a data-theme attribute on the child list element. We can see the header turned blue, and the artist name is used as the header. We have a choice to go back to the previous page (artists) or click again onto a list item (album) to view more. Numbered lists Our music player interface has reached the point in which we need to list the tracks contained in an album. Of course, tracks have a sequence, and we want to give the user the possibility to see what track number is without having to count them all – and without writing numbers manually, that would be terrible! In a very similar fashion, we can use ordered list elements (ol) to obtain numbering: jQuery Mobile will try to use CSS to display numbers or, if not supported, JavaScript. The following code lists all of the tracks for an album: Note there is no limit to the number of lists you can nest. <ul> <!-- ... --> <li>Radiohead <ul> <li><a href="radiohead_ok.html">OK Computer</a></li> <li><a href="radiohead_rainbows.html">In Rainbows</a></li> <li>The King of Limbs <ol> <li><a href="play.html">Bloom</a></li> <li><a href="play.html">Morning Mr. Magpie</a></li> <li><a href="play.html">Little by Little</a></li> <li><a href="play.html">Feral</a></li> <li><a href="play.html">Lotus Flower</a></li> <li><a href="play.html">Codex</a></li> <li><a href="play.html">Give Up the Ghost</a></li> <li><a href="play.html">Separator</a></li> </ol> </li> </ul> </li> <!-- ... --> </ul>
Read more
  • 0
  • 0
  • 5269

article-image-introduction-docker
Packt
09 Feb 2016
4 min read
Save for later

Introduction to Docker

Packt
09 Feb 2016
4 min read
In this article by Rajdeep Dua, the author of the book Learning Docker Networking, we will look at an introduction of Docker networking and its components. (For more resources related to this topic, see here.) Docker is a lightweight containerization technology that has gathered enormous interest in recent years. It neatly bundles various Linux kernel features and services, such as namespaces, cgroups, SELinux, and AppArmor profiles, over union filesystems such as AUFS and BTRFS in order to make modular images. These images provide a highly configurable virtualized environment for applications and follow a write once, run anywhere workflow. Applications can be as simple as running a process to a highly scalable and distributed one. Therefore, there is a need for powerful networking elements that can support various complex use cases. Networking and Docker Each Docker container has its own network stack, and this is due to the Linux kernel's net namespace, where a new net namespace for each container is instantiated and cannot be seen from outside the container or from other containers. Docker networking is powered by the following network components and services. Linux bridges These are L2/MAC learning switches built into the kernel and are to be used for forwarding. Open vSwitch This is an advanced bridge that is programmable and supports tunneling. NAT Network address translators are immediate entities that translate IP addresses and ports (SNAT, DNAT, and so on). IPtables This is a policy engine in the kernel used for managing packet forwarding, firewall, and NAT features. AppArmor/SELinux Firewall policies for each application can be defined with these. Various networking components can be used to work with Docker, providing new ways to access and use Docker-based services. As a result, we see a lot of libraries that follow a different approach to networking. Some of the prominent ones are Docker Compose, Weave, Kubernetes, Pipework, and Libnetwork. The following figure depicts the root ideas of Docker networking: Docker networking modes What's new in Docker networking? Docker networking is at a very nascent stage, and there are many interesting contributions from the developer community, such as Pipework, Weave, Clocker, and Kubernetes. Each of them reflects a different aspect of Docker networking. We will learn about them in later chapters. Docker, Inc. has also established a new project, where networking will be standardized. It is called libnetwork. Libnetwork implements the Container Network Model (CNM), which formalizes the steps required to provide networking for containers while providing an abstraction that can be used to support multiple network drivers. The CNM is built on three main components—sandbox, endpoint, and network. Sandbox A sandbox contains the configuration of a container's network stack. This includes management of the container's interfaces, routing table, and DNS settings. An implementation of a sandbox could be a Linux network namespace, a FreeBSD jail, or other similar concept. A sandbox may contain many endpoints from multiple networks. Endpoint An endpoint connects a sandbox to a network. An implementation of an endpoint could be a veth pair, an Open vSwitch internal port, or something similar. An endpoint can belong to only one network but may only belong to one Sandbox. Network A network is a group of endpoints that are able to communicate with each other directly. An implementation of a network could be a Linux bridge, a VLAN, and so on. Networks consist of many endpoints, as shown in the following diagram: The Docker CNM model The Docker CNM model The CNM provides the following contract between networks and containers: All containers on the same network can communicate freely with each other Multiple networks are the way to segment traffic between containers and should be supported by all drivers Multiple endpoints per container are the way to join a container to multiple networks An endpoint is added to a network sandbox to provide it with network connectivity Summary In this article, we learned about the essential components of Docker networking, which have evolved from coupling simple Docker abstractions and powerful network components such as Linux bridges and Open vSwitch. We also talked about the next generation of Docker networking, which is called libnetwork. Resources for Article: Further resources on this subject: Advanced Container Resource Analysis [article] Docker in Production [article] Elucidating the Game-changing Phenomenon of the Docker-inspired Containerization Paradigm [article]
Read more
  • 0
  • 0
  • 5268

article-image-tips-and-tricks-away3d-36
Packt
18 Mar 2011
7 min read
Save for later

Tips and Tricks on Away3D 3.6

Packt
18 Mar 2011
7 min read
  Away3D 3.6 Essentials Take Flash to the next dimension by creating detailed, animated, and interactive 3D worlds with Away3D Determining the current frame rate When we talk about the performance of an Away3D application, almost always we are referring to the number of frames per second (FPS) that are being rendered. This is also referred to as the frame rate. Higher frame rates result in a more fluid and visually-appealing experience for the end user. Although it is possible to visually determine if an application has an acceptable frame rate, it can also be useful to get a more objective measurement. Fortunately, Away3D has this functionality built in. By default, when it is constructed, the View3D class will create an instance of the Stats class, which is in the away3d.core.stats package. This Stats object can be accessed via the statsPanel property from the View3D class. You can display the output of the Stats object on the screen using the Away3D project stats option in the context (or right-click) menu of an Away3D application. To see the Away3D Project stats option in the context menu you will need to click on a visible 3D object. If you click on the empty space around the 3D objects in the scene, you will see the standard Flash context menu. This will display a window similar to the following screenshot: This window provides a number of useful measurements: FPS, which measures the current frames per second AFPS, which measures the average number of frames per second Max, which measures the maximum peak value of the frames per second MS, which measures the time it took to render the last frame in milliseconds RAM, which measures how much memory the application is using MESHES, which measures the number of 3D objects in the scene SWF FR, which measures the maximum frame rate of the Flash application T ELEMENTS, which measures the total number of individual elements that make up the 3D objects in the scene R ELEMENTS, which measures the number of individual elements that make up the 3D objects that are being rendered to the screen These values come in very handy when trying to quantify the performance of an Away3D application. Setting the maximum frame rate Recent versions of Flash default to a maximum frame rate of 24 frames per second. This is usually fine for animations, but changing the maximum frame rate for a game may allow you to achieve a more fluid end result. The easiest way to do this is to use the SWF frameRate meta tag, which is a line of code added before the Away3DTemplate class. [SWF(frameRate=100)] public class Away3DTemplate extends Sprite { // class definition goes here } The SWF FR measurement displayed by the Away3D Stats object reflects the maximum frame rate defined by the frameRate meta tag. Note that setting the maximum frame rate using the frameRate meta tag does not mean that your application will always run at a higher frame rate, just that it can run at a higher frame rate. A slow PC will still run an Away3D application at a low frame rate even if the maximum frame rate has been set to a high value. You also need to be aware that any calculations performed in the onEnterFrame() function, such as transforming a 3D object, can be dependent on the frame rate of the application. In the following code, we rotate a 3D object by 1 degree around the X-axis every frame. override protected function onEnterFrame(event:Event):void { super.onEnterFrame(event); shipModel.rotationX += 1; } If the frame rate is 30 FPS, the 3D object will rotate around the X-axis by 30 degrees every second. If the frame rate is 90 FPS, the 3D object will rotate around the X-axis by 90 degrees every second. If your application requires these kinds of transformations to be performed consistently regardless of the frame rate, you can use a tweening library. Setting Flash quality to low You may have noticed that Flash offers a number of quality settings in its context menu. This quality setting can be set to one of the four options, which are defined in the StageQuality class from the flash.display package. As described by the Flash API documentation, these settings are: StageQuality.LOW: Low rendering quality. Graphics are not anti-aliased, and bitmaps are not smoothed, but runtime still use mip-mapping. StageQuality.MEDIUM: Medium rendering quality. Graphics are anti-aliased using a 2 x 2 pixel grid, bitmap smoothing is dependent on the Bitmap.smoothing setting. Runtimes use mip-mapping. This setting is suitable for movies that do not contain text. StageQuality.HIGH: High rendering quality. Graphics are anti-aliased using a 4 x 4 pixel grid, and bitmap smoothing is dependent on the Bitmap.smoothing setting. Runtimes use mip-mapping. This is the default rendering quality setting that Flash Player uses. StageQuality.BEST: Very high rendering quality. Graphics are anti-aliased using a 4 x 4 pixel grid. If Bitmap.smoothing is true the runtime uses a high-quality downscale algorithm that produces fewer artifacts. Mip-mapping refers to the use of mip-maps, which are precomputed smaller versions of an original bitmap. They are used instead of the original bitmap when the original is scaled down by more than 50 %. This bitmap scaling may occur when a 3D object with a bitmap material is itself scaled down, or off in the distance within the scene. The quality setting is defined by assigning one of these values to the quality property on the stage object: stage.quality = StageQuality.LOW; A number of demos that are supplied with Away3D set the stage quality by using the SWF quality metatag, like so: [SWF(quality="LOW")] The Flex compiler does not support setting the stage quality in this way. Although this code will not raise any errors during compilation, the stage quality will remain at the default value of StageQuality.HIGH. You can find more information on the metatags supported by the Flex compiler at http://livedocs.adobe.com/flex/3/html/help.html?content=metadata_3.html. Setting the stage quality to low will improve the performance of your Away3D application. The increase is felt most in applications that display a large number of 3D objects. The downside to setting the stage quality to low is that it affects all the objects on the stage, not just those drawn by Away3D. The low stage quality is particularly noticeable when rendering text, so the visual quality of controls like textfields and buttons can be significantly degraded. Using the medium-quality setting offers a good compromise between speed and visual quality. Reducing the size of the viewport The fewer pixels that are drawn to the screen, the faster the rendering process will be. The area that the view will draw into can be defined by assigning a ClippingRectangle object to the clipping property on the View3D class. To use the RectangleClipping class you first need to import it from the away3d.core.clip package. You can then define the area that Away3D will draw into by supplying the minx, maxX, minY, and maxY init object parameters to the RectangleClipping constructor like so: view.clipping = new RectangleClipping( { minX: -100, maxX: 100, minY: -100, maxY: 100 } ); The preceding code will limit the output of the view to an area 200 x 200 units in size. The ViewportClippingDemo application, which can be found on the Packt website as code bundle, allows you to modify the size of the clipping rectangle at runtime using the arrow up and arrow down keys. You can see the difference that the clipping rectangle makes in the following image. On the left, the clipping rectangle is set to the full area of the stage. On the right, the clipping rectangle has been reduced.
Read more
  • 0
  • 0
  • 5267
Packt
02 Sep 2013
4 min read
Save for later

Automating the Audio Parameters – How it Works

Packt
02 Sep 2013
4 min read
(For more resources related to this topic, see here.) We leverage the AudioParam automation support to implement ducking. The following is the overview of the ducking logic implemented in the AudioLayer class: We add a GainNode instance into the node graph as the duck controller. When a sound effect is played, we script the duck controller's gain audio parameter to reduce the audio output gain level for the duration of the sound effect. If ducking is reactivated while it is still active, we revise the scheduled ducking events so that they end at the appropriate time. The following is the node graph diagram produced by the code: Why use two GainNode instances instead of one? It's a good idea to split up the independent scripted audio gain behaviors into separate GainNode instances. This ensures that the scripted behaviors will interact properly. Now, let's take a look at AudioLayer.setDuck() which implements the ducking behavior: The AudioLayer.setDuck() method takes a duration (in seconds) indicating how long the duck behavior should be applied: AudioLayer.prototype.setDuck = function( duration ) { We cache the duck controller's gain audio parameter in duckGain: var TRANSITIONIN_SECS = 1; var TRANSITIONOUT_SECS = 2; var DUCK_VOLUME = 0.3; var duckGain = this.duckNode.gain; We cancel any existing leftover scheduled duck behaviors, thereby allowing us to start with a clean slate: var eventSecs = this.audioContext.currentTime; duckGain.cancelScheduledValues( eventSecs ); We employ the linearRampToValueAtTime() automation behavior to schedule the transition in—the audio parameter is scripted to linearly ramp from the existing volume to the duck volume, DUCK_VOLUME, over the time, TRANSITIONIN_SECS. Because there are no future events scheduled, the behavior starts at the current audio context time: duckGain.linearRampToValueAtTime( DUCK_VOLUME, eventSecs + TRANSITIONIN_SECS ); If the volume is already at DUCK_VOLUME, the transition has no effect, thereby creating the effect of extending the ducking behavior. We add an automation event to mark the start of the TRANSITIONOUT section. We do this by scheduling a setValueAtTime() automation behavior: duckGain.setValueAtTime( DUCK_VOLUME, eventSecs + duration ); Finally, we set up the TRANSITIONOUT section using a linearRampToValueAtTime() automation behavior. We arrange the transition to occur over TRANSITIONOUT_SECS by scheduling its end time to occur after the TRANSITIONOUT_SECS duration of the previous setValueAtTime() automation behavior: // Schedule the volume ramp up duckGain.linearRampToValueAtTime( 1, eventSecs + duration + TRANSITIONOUT_SECS ); }; The following is a graph illustrating the automation we've applied to duckGain, the duck controller's gain audio parameter: In order to have the sound effects activation duck the music volume, the sound effects and music have to be played on separate audio layers. That's why this recipe instantiates two AudioLayer instances—one for music playback and the other for sound effect playback. The dedicated music AudioLayer instance is cached in the WebAudioApp attribute, musicLayer, and the dedicated sound effects AudioLayer instance is cached in WebAudioApp attribute sfxLayer: WebAudioApp.prototype.start = function() { ... this.musicLayer = new AudioLayer( this.audioContext ); this.sfxLayer = new AudioLayer( this.audioContext ); ... }; Whenever a sound effects button is clicked, we play the sound and simultaneously activate the duck behavior on the music layer. This logic is implemented as part of the behavior of the sound effect's click event handler in WebAudioApp.initSfx(): jqButton.click(function( event ) { me.sfxLayer.playAudioBuffer( audioBuffer, 0 ); me.musicLayer.setDuck( audioBuffer.duration ); We activate ducking on webAudioApp.musicLayer, the music's AudioLayer instance. The ducking duration is set to the sound effects duration (we read the sound effects sample duration from its AudioBuffer instance). The ducking behavior is just one demonstration of the power of automation. The possibilities are endless given the breadth of automation-friendly audio parameters available in Web Audio. Other possible effects that are achievable through automation include fades, tempo matching, and cyclic panning effects. Please refer to the latest online W3C Web Audio documentation at http://www.w3.org/TR/webaudio/ for a complete list of available audio parameters. Summary In this article we looked at different rules in scheduling the automation events. we also looked at the overview of ducking logic implemented in the AudioLayer class and checked how we implement AudioLayer.setDuck() method which implements the ducking behaviour. Finally, we analyzed the ducking behaviour with the help of graph. Resources for Article : Further resources on this subject: Adding Sound, Music, and Video in 3D Game Development with Microsoft Silverlight 3: Part 2 [Article] Audio Enhancements with p4a.ploneaudio in Plone 3.3 [Article] Python Multimedia: Working with Audios [Article]
Read more
  • 0
  • 0
  • 5261

article-image-understanding-point-time-recovery
Packt
04 Sep 2013
28 min read
Save for later

Understanding Point-In-Time-Recovery

Packt
04 Sep 2013
28 min read
(For more resources related to this topic, see here.) Understanding the purpose of PITR PostgreSQL offers a tool called pg_dump to backup a database. Basically, pg_dump will connect to the database, read all the data in transaction isolation level "serializable" and return the data as text. As we are using "serializable", the dump is always consistent. So, if your pg_dump starts at midnight and finishes at 6 A.M, you will have created a backup, which contains all the data as of midnight but no further data. This kind of snapshot creation is highly convenient and perfectly feasible for small to medium amounts of data. A dump is always consistent. This means that all foreign keys are intact; new data added after starting the dump will be missing. It is most likely the most common way to perform standard backups. But, what if your data is so valuable and maybe so large in size that you want to backup it incrementally? Taking a snapshot from time to time might be enough for some applications; for highly critical data, it is clearly not. In addition to that, replaying 20 TB of data in textual form is not efficient either. Point-In-Time-Recovery has been designed to address this problem. How does it work? Based on a snapshot of the database, the XLOG will be replayed later on. This can happen indefinitely or up to a point chosen by you. This way, you can reach any point in time. This method opens the door to many different approaches and features: Restoring a database instance up to a given point in time Creating a standby database, which holds a copy of the original data Creating a history of all changes In this article, we will specifically feature on the incremental backup functionality and describe how you can make your data more secure by incrementally archiving changes to a medium of choice. Moving to the bigger picture The following picture provides an overview of the general architecture in use for Point-In-Time-Recovery: PostgreSQL produces 16 MB segments of transaction log. Every time one of those segments is filled up and ready, PostgreSQL will call the so called archive_command. The goal of archive_command is to transport the XLOG file from the database instance to an archive. In our image, the archive is represented as the pot on the bottom-right side of the image. The beauty of the design is that you can basically use an arbitrary shell script to archive the transaction log. Here are some ideas: Use some simple copy to transport data to an NFS share Run rsync to move a file Use a custom made script to checksum the XLOG file and move it to an FTP server Copy the XLOG file to a tape The possible options to manage XLOG are only limited by imagination. The restore_command is the exact counterpart of the archive_command. Its purpose is to fetch data from the archive and provide it to the instance, which is supposed to replay it (in our image, this is labeled as Restored Backup). As you have seen, replay might be used for replication or simply to restore a database to a given point in time as outlined in this article. Again, the restore_command is simply a shell script doing whatever you wish, file by file. It is important to mention that you, the almighty administrator, are in charge of the archive. You have to decide how much XLOG to keep and when to delete it. The importance of this task cannot be underestimated. Keep in mind, when then archive_command fails for some reason, PostgreSQL will keep the XLOG file and retry after a couple of seconds. If archiving fails constantly from a certain point on, it might happen that the master fills up. The sequence of XLOG files must not be interrupted; if a single file is missing, you cannot continue to replay XLOG. All XLOG files must be present because PostgreSQL needs an uninterrupted sequence of XLOG files; if a single file is missing, the recovery process will stop there at the very latest. Archiving the transaction log After taking a look at the big picture, we can take a look and see how things can be put to work. The first thing you have to do when it comes to Point-In-Time-Recovery is to archive the XLOG. PostgreSQL offers all the configuration options related to archiving through postgresql.conf. Let us see step by step what has to be done in postgresql.conf to start archiving: First of all, you should turn archive_mode on. In the second step, you should configure your archive command. The archive command is a simple shell command taking two parameters: %p: This is a placeholder representing the XLOG file that should be archived, including its full path (source). %f: This variable holds the name of the XLOG without the path pointing to it. Let us set up archiving now. To do so, we should create a place to put the XLOG. Ideally, the XLOG is not stored on the same hardware as the database instance you want to archive. For the sake of this example, we assume that we want to apply an archive to /archive. The following changes have to be made to postgresql.conf: wal_level = archive # minimal, archive, or hot_standby # (change requires restart) archive_mode = on # allows archiving to be done # (change requires restart) archive_command = 'cp %p /archive/%f' # command to use to archive a logfile segment # placeholders: %p = path of file to archive # %f = file name only Once those changes have been made, archiving is ready for action and you can simply restart the database to activate things. Before we restart the database instance, we want to focus your attention on wal_level. Currently three different wal_level settings are available: minimal archive hot_standby The amount of transaction log produced in the case of just a single node is by far not enough to synchronize an entire second instance. There are some optimizations in PostgreSQL, which allow XLOG-writing to be skipped in the case of single-node mode. The following instructions can benefit from wal_level being set to minimal: CREATE TABLE AS, CREATE INDEX, CLUSTER, and COPY (if the table was created or truncated within the same transaction). To replay the transaction log, at least archive is needed. The difference between archive and hot_standby is that archive does not have to know about currently running transactions. For streaming replication, however, this information is vital. Restarting can either be done through pg_ctl –D /data_directory –m fast restart directly or through a standard init script. The easiest way to check if our archiving works is to create some useless data inside the database. The following code snippets shows a million rows can be made easily: test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1,1000000);SELECT 1000000test=# SELECT * FROM t_test LIMIT 3;generate_series----------------- 1 2 3(3 rows) We have simply created a list of numbers. The important thing is that 1 million rows will trigger a fair amount of XLOG traffic. You will see that a handful of files have made it to the archive: iMac:archivehs$ ls -ltotal 131072-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000001-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000002-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000003-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000004 Those files can be easily used for future replay operations. If you want to save storage, you can also compress those XLOG files. Just add gzip to your archive_command. Taking base backups In the previous section, you have seen that enabling archiving takes just a handful of lines and offers a great deal of flexibility. In this section, we will see how to create a so called base backup, which can be used to apply XLOG later on. A base backup is an initial copy of the data. Keep in mind that the XLOG itself is more or less worthless. It is only useful in combination with the initial base backup. In PostgreSQL, there are two main options to create an initial base backup: Using pg_basebackup Traditional copy/rsync based methods The following two sections will explain in detail how a base backup can be created: Using pg_basebackup The first and most common method to create a backup of an existing server is to run a command called pg_basebackup, which was introduced in PostgreSQL 9.1.0. Basically pg_basebackup is able to fetch a database base backup directly over a database connection. When executed on the slave, pg_basebackup will connect to the database server of your choice and copy all the data files in the data directory over to your machine. There is no need to log into the box anymore, and all it takes is one line of code to run it; pg_basebackup will do all the rest for you. In this example, we will assume that we want to take a base backup of a host called sample.postgresql-support.de. The following steps must be performed: Modify pg_hba.conf to allow replication Signal the master to take pg_hba.conf changes into account Call pg_basebackup Modifying pg_hba.conf To allow remote boxes to log into a PostgreSQL server and to stream XLOG, you have to explicitly allow replication. In PostgreSQL, there is a file called pg_hba.conf, which tells the server which boxes are allowed to connect using which type of credentials. Entire IP ranges can be allowed or simply discarded through pg_hba.conf. To enable replication, we have to add one line for each IP range we want to allow. The following listing contains an example of a valid configuration: # TYPE DATABASE USER ADDRESS METHODhost replication all 192.168.0.34/32 md5 In this case we allow replication connections from 192.168.0.34. The IP range is identified by 32 (which simply represents a single server in our case). We have decided to use MD5 as our authentication method. It means that the pg_basebackup has to supply a password to the server. If you are doing this in a non-security critical environment, using trust as authentication method might also be an option. What happens if you actually have a database called replication in your system? Basically, setting the database to replication will just configure your streaming behavior, if you want to put in rules dealing with the database called replication, you have to quote the database name as follows: "replication". However, we strongly advise not to do this kind of trickery to avoid confusion. Signaling the master server Once pg_hba.conf has been changed, we can tell PostgreSQL to reload the configuration. There is no need to restart the database completely. We have three options to make PostgreSQL reload pg_hba.conf: By running an SQL command: SELECT pg_reload_conf(); By sending a signal to the master: kill –HUP 4711 (with 4711 being the process ID of the master) By calling pg_ctl: pg_ctl –D $PGDATA reload (with $PGDATA being the home directory of your database instance) Once we have told the server acting as data source to accept streaming connections, we can move forward and run pg_basebackup. pg_basebackup – basic features pg_basebackup is a very simple-to-use command-line tool for PostgreSQL. It has to be called on the target system and will provide you with a ready-to-use base backup, which is ready to consume the transaction log for Point-In-Time-Recovery. The syntax of pg_basebackup is as follows: iMac:dbhs$ pg_basebackup --help pg_basebackup takes a base backup of a running PostgreSQL server. Usage: pg_basebackup [OPTION]... Options controlling the output: -D, --pgdata=DIRECTORY receive base backup into directory -F, --format=p|t output format (plain (default), tar) -x, --xlog include required WAL files in backup (fetch mode) -X, --xlog-method=fetch|stream include required WAL files with specified method -z, --gzip compress tar output -Z, --compress=0-9 compress tar output with given compression level General options: -c, --checkpoint=fast|spread set fast or spread checkpointing -l, --label=LABEL set backup label -P, --progress show progress information -v, --verbose output verbose messages -V, --version output version information, then exit -?, --help show this help, then exit Connection options: -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -s, --status-interval=INTERVAL time between status packets sent to server (in seconds) -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) A basic call to pg_basebackup would look like that: iMac:dbhs$ pg_basebackup -D /target_directory -h sample.postgresql-support.de In this example, we will fetch the base backup from sample.postgresql-support.de and put it into our local directory called /target_directory. It just takes this simple line to copy an entire database instance to the target system. When we create a base backup as shown in this section, pg_basebackup will connect to the server and wait for a checkpoint to happen before the actual copy process is started. In this mode, this is necessary because the replay process will start exactly at this point in the XLOG. The problem is that it might take a while until a checkpoint kicks in; pg_basebackup does not enforce a checkpoint on the source server straight away to make sure that normal operations are not disturbed. If you don't want to wait on a checkpoint, consider using --checkpoint=fast. It will enforce an instant checkpoint and pg_basebackup will start copying instantly. By default, a plain base backup will be created. It will consist of all the files in directories found on the source server. If the base backup should be stored on tape, we suggest to give –-format=t a try. It will automatically create a TAR archive (maybe on a tape). If you want to move data to a tape, you can save an intermediate step easily this way. When using TAR, it is usually quite beneficial to use it in combination with --gzip to reduce the size of the base backup on disk. There is also a way to see a progress bar while doing the base backup but we don't recommend to use this option (--progress) because it requires pg_basebackup to determine the size of the source instance first, which can be costly. pg_basebackup – self-sufficient backups Usually a base backup without XLOG is pretty worthless. This is because the base backup is taken while the master is fully operational. While the backup is taken, those storage files in the database instance might have been modified heavily. The purpose of the XLOG is to fix those potential issues in the data files reliably. But, what if we want to create a base backup, which can live without (explicitly archived) XLOG? In this case, we can use the --xlog-method=stream option. If this option has been chosen, pg_basebackup will not just copy the data as it is but it will also stream the XLOG being created during the base backup to our destination server. This will provide us with just enough XLOG to allow us to start a base backup made that way directly. It is self-sufficient and does not need additional XLOG files. This is not Point-In-Time-Recovery but it can come in handy in case of trouble. Having a base backup, which can be started right away, is usually a good thing and it comes at fairly low cost. Please note that --xlog-method=stream will require two database connections to the source server, not just one. You have to keep that in mind when adjusting max_wal_senders on the source server. If you are planning to use Point-In-Time-Recovery and if there is absolutely no need to start the backup as it is, you can safely skip the XLOG and save some space this way (default mode). Making use of traditional methods to create base backups These days pg_basebackup is the most common way to get an initial copy of a database server. This has not always been the case. Traditionally, a different method has been used which works as follows: Call SELECT pg_start_backup('some label'); Copy all data files to the remote box through rsync or any other means. Run SELECT pg_stop_backup(); The main advantage of this old method is that there is no need to open a database connection and no need to configure XLOG-streaming infrastructure on the source server. A main advantage is also that you can make use of features such as ZFS-snapshots or similar means, which can help to dramatically reduce the amount of I/O needed to create an initial backup. Once you have started pg_start_backup, there is no need to hurry. It is not necessary and not even especially desirable to leave the backup mode soon. Nothing will happen if you are in backup mode for days. PostgreSQL will archive the transaction log as usual and the user won't face any kind of downside. Of course, it is bad habit not to close backups soon and properly. However, the way PostgreSQL works internally does not change when a base backup is running. There is nothing filling up, no disk I/O delayed, or anything of this sort. Tablespace issues If you happen to use more than one tablespace, pg_basebackup will handle this just fine if the filesystem layout on the target box is identical to the filesystem layout on the master. However, if your target system does not use the same filesystem layout there is a bit more work to do. Using the traditional way of doing the base backup might be beneficial in this case. In case you are using --format=t (for TAR), you will be provided with one TAR file per tablespace. Keeping an eye on network bandwidth Let us imagine a simple scenario involving two servers. Each server might have just one disk (no SSDs). Our two boxes might be interconnected through a 1 Gbit link. What will happen to your applications if the second server starts to run a pg_basebackup? The second box will connect, start to stream data at full speed and easily kill your hard drive by using the full bandwidth of your network. An application running on the master might instantly face disk wait and offer bad response times. Therefore it is highly recommended to control the bandwidth used up by rsync to make sure that your business applications have enough spare capacity (mainly disk, CPU is usually not an issue). If you want to limit rsync to, say, 20 MB/sec, you can simply use rsync --bwlimit=20000. This will definitely make the creation of the base backup take longer but it will make sure that your client apps will not face problems. In general we recommend a dedicated network interconnect between master and slave to make sure that a base backup does not affect normal operations. Limiting bandwidth cannot be done with pg_basebackup onboard functionality.Of course, you can use any other tool to copy data and achieve similar results. If you are using gzip compression with –-gzip, it can work as an implicit speed brake. However, this is mainly a workaround. Replaying the transaction log Once we have created ourselves a shiny initial base backup, we can collect the XLOG files created by the database. When the time has come, we can take all those XLOG files and perform our desired recovery process. This works as described in this section. Performing a basic recovery In PostgreSQL, the entire recovery process is governed by a file named recovery.conf, which has to reside in the main directory of the base backup. It is read during startup and tells the database server where to find the XLOG archive, when to end replay, and so forth. To get you started, we have decided to include a simple recovery.conf sample file for performing a basic recovery process: restore_command = 'cp /archive/%f %p'recovery_target_time = '2013-10-10 13:43:12' The restore_command is essentially the exact counterpart of the archive_command you have seen before. While the archive_command is supposed to put data into the archive, the restore_command is supposed to provide the recovering instance with the data file by file. Again, it is a simple shell command or a simple shell script providing one chunk of XLOG after the other. The options you have here are only limited by imagination; all PostgreSQL does is to check for the return code of the code you have written, and replay the data provided by your script. Just like in postgresql.conf, we have used %p and %f as placeholders; the meaning of those two placeholders is exactly the same. To tell the system when to stop recovery, we can set the recovery_target_time. The variable is actually optional. If it has not been specified, PostgreSQL will recover until it runs out of XLOG. In many cases, simply consuming the entire XLOG is a highly desirable process; if something crashes, you want to restore as much data as possible. But, it is not always so. If you want to make PostgreSQL stop recovery at a specific point in time, you simply have to put the proper date in. The crucial part here is actually to know how far you want to replay XLOG; in a real work scenario this has proven to be the trickiest question to answer. If you happen to a recovery_target_time, which is in the future, don't worry, PostgreSQL will start at the very last transaction available in your XLOG and simply stop recovery. The database instance will still be consistent and ready for action. You cannot break PostgreSQL, but, you might break your applications in case data is lost because of missing XLOG. Before starting PostgreSQL, you have to run chmod 700 on the directory containing the base backup, otherwise, PostgreSQL will error out: iMac:target_directoryhs$ pg_ctl -D /target_directorystartserver startingFATAL: data directory "/target_directory" has group or world accessDETAIL: Permissions should be u=rwx (0700). This additional security check is supposed to make sure that your data directory cannot be read by some user accidentally. Therefore an explicit permission change is definitely an advantage from a security point of view (better safe than sorry). Now that we have all the pieces in place, we can start the replay process by starting PostgreSQL: iMac:target_directoryhs$ pg_ctl –D /target_directory startserver startingLOG: database system was interrupted; last known up at 2013-03-1018:04:29 CETLOG: creating missing WAL directory "pg_xlog/archive_status"LOG: starting point-in-time recovery to 2013-10-10 13:43:12+02LOG: restored log file "000000010000000000000006" from archiveLOG: redo starts at 0/6000020LOG: consistent recovery state reached at 0/60000B8LOG: restored log file "000000010000000000000007" from archiveLOG: restored log file "000000010000000000000008" from archiveLOG: restored log file "000000010000000000000009" from archiveLOG: restored log file "00000001000000000000000A" from archivecp: /tmp/archive/00000001000000000000000B: No such file ordirectoryLOG: could not open file "pg_xlog/00000001000000000000000B" (logfile 0, segment 11): No such file or directoryLOG: redo done at 0/AD5CE40LOG: last completed transaction was at log time 2013-03-1018:05:33.852992+01LOG: restored log file "00000001000000000000000A" from archivecp: /tmp/archive/00000002.history: No such file or directoryLOG: selected new timeline ID: 2cp: /tmp/archive/00000001.history: No such file or directoryLOG: archive recovery completeLOG: database system is ready to accept connectionsLOG: autovacuum launcher started The amount of log produced by the database tells us everything we need to know about the restore process and it is definitely worth investigating this information in detail. The first line indicates that PostgreSQL has found out that it has been interrupted and that it has to start recovery. From the database instance point of view, a base backup looks more or less like a crash needing some instant care by replaying XLOG; this is precisely what we want. The next couple of lines (restored log file ...) indicate that we are replaying one XLOG file after the other that have been created since the base backup. It is worth mentioning that the replay process starts at the sixth file. The base backup knows where to start, so PostgreSQL will automatically look for the right XLOG file. The message displayed after PostgreSQL reaches the sixth file (consistent recovery state reached at 0/60000B8) is of importance. PostgreSQL states that it has reached a consistent state. This is important. The reason is that the data files inside a base backup are actually broken by definition, but, the data files are not broken beyond repair. As long as we have enough XLOG to recover, we are very well off. If you cannot reach a consistent state, your database instance will not be usable and your recovery cannot work without providing additional XLOG. Practically speaking, not being able to reach a consistent state usually indicates a problem somewhere in your archiving process and your system setup. If everything up to now has been working properly, there is no reason not to reach a consistent state. Once we have reached a consistent state, one file after the other will be replayed successfully until the system finally looks for the 00000001000000000000000B file. The problem is that this file has not been created by the source database instance. Logically, an error pops up. Not finding the last file is absolutely normal; this type of error is expected if the recovery_target_time does not ask PostgreSQL to stop recovery before it reaches the end of the XLOG stream. Don't worry, your system is actually fine. You have successfully replayed everything to the file showing up exactly before the error message. As soon as all the XLOG has been consumed and the error message discussed earlier has been issued, PostgreSQL reports the last transaction it was able or supposed to replay, and starts up. You have a fully recovered database instance now and you can connect to the database instantly. As soon as the recovery has ended, recovery.conf will be renamed by PostgreSQL to recovery.done to make sure that it does not do any harm when the new instance is restarted later on at some point. More sophisticated positioning in the XLOG Up to now, we have recovered a database up to the very latest moment available in our 16 MB chunks of transaction log. We have also seen that you can define the desired recovery timestamp. But the question now is: How do you know which point in time to perform the recovery to? Just imagine somebody has deleted a table during the day. What if you cannot easily determine the recovery timestamp right away? What if you want to recover to a certain transaction? recovery.conf has all you need. If you want to replay until a certain transaction, you can refer to recovery_target_xid. Just specify the transaction you need and configure recovery_target_inclusive to include this very specific transaction or not. Using this setting is technically easy but as mentioned before, it is not easy by far to find the right position to replay to. In a typical setup, the best way to find a reasonable point to stop recovery is to use pause_at_recovery_target. If this is set to true, PostgreSQL will not automatically turn into a productive instance if the recovery point has been reached. Instead, it will wait for further instructions from the database administrator. This is especially useful if you don't know exactly how far to replay. You can replay, log in, see how far the database is, change to the next target time, and continue replaying in small steps. You have to set hot_standby = on in postgresql.conf to allow reading during recovery. Resuming recovery after PostgreSQL has paused can be done by calling a simple SQL statement: SELECT pg_xlog_replay_resume(). It will make the instance move to the next position you have set in recovery.conf. Once you have found the right place, you can set the pause_at_recovery_target back to false and call pg_xlog_replay_resume. Alternatively, you can simply utilize pg_ctl –D ... promote to stop recovery and make the instance operational. Was this explanation too complicated? Let us boil it down to a simple list: Add a restore_command to the recovery.conf file. Add recovery_target_time to the recovery.conf file. Set pause_at_recovery_target to true in the recovery.conf file. Set hot_standby to on in postgresql.conf file. Start the instance to be recovered. Connect to the instance once it has reached a consistent state and as soon as it stops recovering. Check if you are already where you want to be. If you are not: Change recovery_target_time. Run SELECT pg_xlog_replay_resume(). Check again and repeat this section if it is necessary. Keep in mind that once recovery has finished and once PostgreSQL has started up as a normal database instance, there is (as of 9.2) no way to replay XLOG later on. Instead of going through this process, you can of course always use filesystem snapshots. A filesystem snapshot will always work with PostgreSQL because when you restart a snapshotted database instance, it will simply believe that it had crashed before and recover normally. Cleaning up the XLOG on the way Once you have configured archiving, you have to store the XLOG being created by the source server. Logically, this cannot happen forever. At some point, you really have to get rid of this XLOG; it is essential to have a sane and sustainable cleanup policy for your files. Keep in mind, however, that you must keep enough XLOG so that you can always perform recovery from the latest base backup. But if you are certain that a specific base backup is not needed anymore, you can safely clean out all the XLOG that is older than the base backup you want to keep. How can an administrator figure out what to delete? The best method is to simply take a look at your archive directory: 000000010000000000000005000000010000000000000006000000010000000000000006.00000020.backup000000010000000000000007000000010000000000000008 Check out the filename in the middle of the listing. The .backup file has been created by the base backup. It contains some information about the way the base backup has been made and tells the system where to continue replaying the XLOG. If the backup file belongs to the oldest base backup you need to keep around, you can safely erase all the XLOG lower than file number 6; in this case, file number 5 could be safely deleted. In our case, 000000010000000000000006.00000020.backup contains the following information: START WAL LOCATION: 0/6000020 (file 000000010000000000000006)STOP WAL LOCATION: 0/60000E0 (file 000000010000000000000006)CHECKPOINT LOCATION: 0/6000058BACKUP METHOD: streamedBACKUP FROM: masterSTART TIME: 2013-03-10 18:04:29 CETLABEL: pg_basebackup base backupSTOP TIME: 2013-03-10 18:04:30 CET The .backup file will also provide you with relevant information such as the time the base backup has been made. It is plain there and so it should be easy for ordinary users to read this information. As an alternative to deleting all the XLOG files at one point, it is also possible to clean them up during replay. One way is to hide an rm command inside your restore_command. While this is technically possible, it is not necessarily wise to do so (what if you want to recover again?). Also, you can add the recovery_end_command command to your recovery.conf file. The goal of recovery_end_command is to allow you to automatically trigger some action as soon as the recovery ends. Again, PostgreSQL will call a script doing precisely what you want. You can easily abuse this setting to clean up the old XLOG when the database declares itself active. Switching the XLOG files If you are going for an XLOG file-based recovery, you have seen that one XLOG will be archived every 16 MB. What would happen if you never manage to create 16 MB of changes? What if you are a small supermarket, which just makes 50 sales a day? Your system will never manage to fill up 16 MB in time. However, if your system crashes, the potential data loss can be seen as the amount of data in your last unfinished XLOG file. Maybe this is not good enough for you. A postgresql.conf setting on the source database might help. The archive_timeout tells PostgreSQL to create a new XLOG file at least every x seconds. So, if you are this little supermarket, you can ask the database to create a new XLOG file every day shortly before you are heading for home. In this case, you can be sure that the data of the day will safely be on your backup device already. It is also possible to make PostgreSQL switch to the next XLOG file by hand. A procedure named pg_switch_xlog() is provided by the server to do the job: test=# SELECT pg_switch_xlog();pg_switch_xlog----------------0/17C0EF8(1 row) You might want to call this procedure when some important patch job has finished or if you want to make sure that a certain chunk of data is safely in your XLOG archive. Summary In this article, you have learned about Point-In-Time-Recovery, which is a safe and easy way to restore your PostgreSQL database to any desired point in time. PITR will help you to implement better backup policies and make your setups more robust. Resources for Article: Further resources on this subject: Introduction to PostgreSQL 9 [Article] PostgreSQL: Tips and Tricks [Article] PostgreSQL 9: Reliable Controller and Disk Setup [Article]
Read more
  • 0
  • 0
  • 5261
Modal Close icon
Modal Close icon