Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-basic-and-interactive-plots
Packt
06 Feb 2015
19 min read
Save for later

Basic and Interactive Plots

Packt
06 Feb 2015
19 min read
In this article by Atmajitsinh Gohil, author of the book R Data Visualization Cookbook, we will cover the following topics: A simple bar plot A simple line plot Line plot to tell an effective story Merging histograms Making an interactive bubble plot (For more resources related to this topic, see here.) The main motivation behind this article is to introduce the basics of plotting in R and an element of interactivity via the googleVis package. The basic plots are important as many packages developed in R use basic plot arguments and hence understanding them creates a good foundation for new R users. We will start by exploring the scatter plots in R, which are the most basic plots for exploratory data analysis, and then delve into interactive plots. Every section will start with an introduction to basic R plots and we will build interactive plots thereafter. We will utilize the power of R analytics and implement them using the googleVis package to introduce the element of interactivity. The googleVis package is developed by Google and it uses the Google Chart API to create interactive plots. There are a range of plots available with the googleVis package and this provides us with an advantage to plot the same data on various plots and select the one that delivers an effective message. The package undergoes regular updates and releases, and new charts are implemented with every release. The readers should note that there are other alternatives available to create interactive plots in R, but it is not possible to explore all of them and hence I have selected googleVis to display interactive elements in a chart. I have selected these purely based on my experience with interactivity in plots. The other good interactive package is offered by GGobi. A simple bar plot A bar plot can often be confused with histograms. Histograms are used to study the distribution of data whereas bar plots are used to study categorical data. Both the plots may look similar to the naked eye but the main difference is that the width of a bar plot is not of significance, whereas in histograms the width of the bars signifies the frequency of data. In this recipe, I have made use of the infant mortality rate in India. The data is made available by the Government of India. The main objective is to study the basics of a bar plot in R as shown in the following screenshot: How to do it… We start the recipe by importing our data in R using the read.csv() function. R will search for the data under the current directory, and hence we use the setwd() function to set our working directory: setwd("D:/book/scatter_Area/chapter2") data = read.csv("infant.csv", header = TRUE) Once we import the data, we would like to process the data by ordering it. We order the data using the order() function in R. We would like R to order the column Total2011 in a decreasing order: data = data[order(data$Total2011, decreasing = TRUE),] We use the ifelse() function to create a new column. We would utilize this new column to add different colors to bars in our plot. We could also write a loop in R to do this task but we will keep this for later. The ifelse() function is quick and easy. We instruct R to assign yes if values in the column Total2011 are more than 12.2 and no otherwise. The 12.2 value is not randomly chosen but is the average infant mortality rate of India: new = ifelse(data$Total2011>12.2,"yes","no") Next, we would like to join the vector of yes and no to our original dataset. In R, we can join columns using the cbind() function. Rows can be combined using rbind(): data = cbind(data,new) When we initially plot the bar plot, we observe that we need more space at the bottom of the plot. We adjust the margins of a plot in R by passing the mar() argument within the par() function. The mar() function uses four arguments: bottom, left, top, and right spacing: par(mar = c(10,5,5,5)) Next, we generate a bar plot in R using the barplot() function. The abline() function is used to add a horizontal line on the bar plot: barplot(data$Total2011, las = 2, names.arg= data$India,width =0.80, border = NA,ylim=c(0,20), col = "#e34a33", main = "InfantMortality Rate of India in 2011")abline(h = 12.2, lwd =2, col = "white", lty =2) How it works… The order() function uses permutation to rearrange (decreasing or increasing) the rows based on the variable. We would like to plot the bars from highest to lowest, and hence we require to arrange the data. The ifelse() function is used to generate a new column. We would use this column under the There's more… section of this recipe. The first argument under the ifelse() function is the logical test to be performed. The second argument is the value to be assigned if the test is true, and the third argument is the value to be assigned if the logical test fails. The first argument in the barplot() function defines the height of the bars and horiz = TRUE (not used in our code) instructs R to plot the bars horizontally. The default setting in R will plot the bars vertically. The names.arg argument is used to label the bars. We also specify border = NA to remove the borders and las = 2 is specified to apply the direction to our labels. Try replacing the las values with 1,2,3, or 4 and observe how the orientation of our labels change.. The first argument in the abline() function assigns the position where the line is drawn, that is, vertical or horizontal. The lwd, lty, and col arguments are used to define the width, line type, and color of the line. There's more… While plotting a bar plot, it's a good practice to order the data in ascending or descending order. An unordered bar plot does not convey the right message and the plot is hard to read when there are more bars involved. When we observe a plot, we are interested to get the most information out, and ordering the data is the first step toward achieving this objective. We have not specified how we can use the ifelse() and cbind() functions in the plot. If we would like to color the plot with different colors to let the readers know which states have high infant mortality above the country level, we can do this by pasting col = (data$new) in place of col = "#e34a33". See also New York Times has a very interesting implementation of an interactive bar chart and can be accessed at http://www.nytimes.com/interactive/2007/09/28/business/20070930_SAFETY_GRAPHIC.html A simple line plot Line plots are simply lines connecting all the x and y dots. They are very easy to interpret and are widely used to display an upward or downward trend in data. In this recipe, we will use the googleVis package and create an interactive R line plot. We will learn how we can emphasize on certain variables in our data. The following line plot shows fertility rate: Getting ready We will use the googleVis package to generate a line plot. How to do it… In order to construct a line chart, we will install and load the googleVis package in R. We would also import the fertility data using the read.csv() function: install.packages("googleVis") library(googleVis) frt = read.csv("fertility.csv", header = TRUE, sep =",") The fertility data is downloaded from the OECD website. We can construct our line object using the gvisLineChart() function: gvisLineChart(frt, xvar = "Year","yvar=c("Australia","Austria","Belgium","Canada","Chile","OECD34"), options = list( width = 1100, height= 500, backgroundColor = " "#FFFF99",title ="Fertility Rate in OECD countries" , vAxis = "{title : 'Total Fertility " Rate',gridlines:{color:'#DEDECE',count : 4}, ticks : "   [0,1,2,3,4]}", series = "{0:{color:'black', visibleInLegend :false},        1:{color:'BDBD9D', visibleInLegend :false},        2:{color:'BDBD9D', visibleInLegend :false},            3:{color:'BDBD9D', visibleInLegend :false},           4:{color:'BDBD9D', visibleInLegend :false},          34:{color:'3333FF', visibleInLegend :true}}")) We can construct the visualization using the plot() function in R: plot(line) How it works… The first three arguments of the gvisLineChart() function are the data and the name of the columns to be plotted on the x-axis and y-axis. The options argument lists the chart API options to add and modify elements of a chart. For the purpose of this recipe, we will use part of the dataset. Hence, while we assign the series to be plotted under yvar = c(), we will specify the column names that we would like to be plotted in our chart. Note that the series starts at 0, and hence Australia, which is the first column, is in fact series 0 and not 1. For the purpose of this exercise, let's assume that we would like to demonstrate the mean fertility rate among all OECD economies to our audience. We can achieve this using series {} under option = list(). The series argument will allow us to specify or customize a specific series in our dataset. Under the gvisLineChart() function, we instruct the Google Chart API to color OECD series (series 34) and Australia (series 0) with a different color and also make the legend visible only for OECD and not the entire series. It would be best to display all the legends but we use this to show the flexibility that comes with the Google Chart API. Finally, we can use the plot() function to plot the chart in a browser. The following screenshot displays a part of the data. The dim() function gives us a general idea about the dimensions of the fertility data: New York Times Visualization often combines line plots with bar chart and pie charts. Readers should try constructing such visualization. We can use the gvisMerge() function to merge plots. The function allows merging of just two plots and hence the readers would have to use multiple gvisMerge() functions to create a very similar visualization. The same can also be constructed in R but we will lose the interactive element. See also The OECD website provides economic data related to OECD member countries. The data can be freely downloaded from the website http://www.oecd.org/statistics/. New York Times Visualization combines bar charts and line charts and can be accessed at http://www.nytimes.com/imagepages/2009/10/16/business/20091017_CHARTS_GRAPHIC.html. Line plot to tell an effective story In the previous recipe, we learned how to plot a very basic line plot and use some of the options. In this recipe, we will go a step further and make use of specific visual cues such as color and line width for easy interpretation. Line charts are a great tool to visualize time series data. The fertility data is discrete but connecting points over time provides our audience with a direction. The visualization shows the amazing progress countries such as Mexico and Turkey have achieved in reducing their fertility rate. OECD defines fertility rate as Refers to the number of children that would be born per woman, assuming no female mortality at child-bearing ages and the age-specific fertility rates of a specified country and reference period. Line plots have been widely used by New York Times to create very interesting infographics. This recipe is inspired by one of the New York Times visualizations. It is very important to understand that many of the infographics created by professionals are created using D3.js or Processing. We will not go into the detail of the same but it is good to know the working of these softwares and how they can be used to create visualizations. Getting ready We would need to install and load the googleVis package to construct a line chart. How to do it… To generate an interactive plot, we will load the fertility data in R using the read.csv() function. To generate a line chart that plots the entire dataset, we will use the gvisLineChart() function: line = gvisLineChart(frt, xvar = "Year", yvar=c("Australia",""Austria","Belgium","Canada","Chile","Czech.Republic", "Denmark","Estonia","Finland","France","Germany","Greece","Hungary"", "Iceland","Ireland","Israel","Italy","Japan","Korea","Luxembourg",""Mexico", "Netherlands","New.Zealand","Norway","Poland","Portugal","Slovakia"","Slovenia", "Spain","Sweden","Switzerland","Turkey","United.Kingdom","United."States","OECD34"), options = list( width = 1200, backgroundColor = "#ADAD85",title " ="Fertility Rate in OECD countries" , vAxis = "{gridlines:{color:'#DEDECE',count : 3}, ticks : " [0,1,2,3,4]}", series = "{0:{color:'BDBD9D', visibleInLegend :false}, 20:{color:'009933', visibleInLegend :true}, 31:{color:'996600', visibleInLegend :true}, 34:{color:'3333FF', visibleInLegend :true}}")) To display our visualization in a new browser, we use the generic R plot() function: plot(line) How it works… The arguments passed in the gvisLineChart() function, are exactly the same as discussed under the simple line plot with some minor changes. We would like to plot the entire data for this exercise, and hence we have to state all the column names in yvar =c(). Also, we would like to color all the series with the same color but highlight Mexico, Turkey, and OECD average. We have achieved this in the previous code using series {}, and further specify and customize colors and legend visibility for specific countries. In this particular plot, we have made use of the same color for all the economies but have highlighted Mexico and Turkey to signify the development and growth that took place in the 5-year period. It would also be effective if our audience could compare the OECD average with Mexico and Turkey. This provides the audience with a benchmark they can compare with. If we plot all the legends, it may make the plot too crowded and 34 legends may not make a very attractive plot. We could avoid this by only making specific legends visible. See also D3 is a great tool to develop interactive visualization and this can be accessed at http://d3js.org/. Processing is an open source software developed by MIT and can be downloaded from https://processing.org/. A good resource to pick colors and use them in our plots is the following link: http://www.w3schools.com/tags/ref_colorpicker.asp. I have used New York Times infographics as an inspiration for this plot. You can find a collection of visualization put out by New York Times in 2011 by going to this link, http://www.smallmeans.com/new-york-times-infographics/. Merging histograms Histograms help in studying the underlying distribution. It is more useful when we are trying to compare more than one histogram on the same plot; this provides us with greater insight into the skewness and the overall distribution. In this recipe, we will study how to plot a histogram using the googleVis package and how we merge more than one histogram on the same page. We will only merge two plots but we can merge more plots and try to adjust the width of each plot. This makes it easier to compare all the plots on the same page. The following plot shows two merged histograms: How to do it… In order to generate a histogram, we will install the googleVis package as well as load the same in R: install.packages("googleVis") library(googleVis) We have downloaded the prices of two different stocks and have calculated their daily returns over the entire period. We can load the data in R using the read.csv() function. Our main aim in this recipe is to plot two different histograms and plot them side by side in a browser. Hence, we require to divide our data in three different data frames. For the purpose of this recipe, we will plot the aapl and msft data frames: stk = read.csv("stock_cor.csv", header = TRUE, sep = ",") aapl = data.frame(stk$AAPL) msft = data.frame(stk$MSFT) googl = data.frame(stk$GOOGL) To generate the histograms, we implement the gvisHistogram() function: al = gvisHistogram(aapl, options = list(histogram = "{bucketSize " :1}",legend = "none",title ='Distribution of AAPL Returns', "   width = 500,hAxis = "{showTextEvery: 5,title: "     'Returns'}",vAxis = "{gridlines : {count:4}, title : "       'Frequency'}")) mft = gvisHistogram(msft, options = list(histogram = "{bucketSize " :1}",legend = "none",title ='Distribution of MSFT Returns', "   width = 500,hAxis = "{showTextEvery: 5,title: 'Returns'}","     vAxis = "{gridlines : {count:4}, title : 'Frequency'}")) We combine the two gvis objects in one browser using the gvisMerge() function: mrg = gvisMerge(al,mft, horizontal = TRUE) plot(mrg) How it works… The data.frame() function is used to construct a data frame in R. We require this step as we do not want to plot all the three histograms on the same plot. Note the use of the $ notation in the data.frame() function. The first argument in the gvisHistogram() function is our data stored as a data frame. We can display individual histograms using the plot(al) and plot(mft) functions. But in this recipe, we will plot the final output. We observe that most of the attributes of a histogram function are the same as discussed in previous recipes. The histogram functionality will use an algorithm to create buckets, but we can control this using the bucketSize as histogram = "{bucketSize :1}". Try using different bucket sizes and observe how the buckets in the histograms change. More options related to histograms can also be found in the following link under the Controlling Buckets section: https://developers.google.com/chart/interactive/docs/gallery/histogram#Buckets We have utilized showTextEvery, which is also very specific to histograms. This option allows us to specify how many horizontal axis labels we would like to show. We have used 5 to make the histogram more compact. Our main objective is to observe the distribution and the plot serves our purpose. Finally, we will implement plot() to plot the chart in our favorite browser. We do the same steps to plot the return distribution of Microsoft (MSFT). Now, we would like to place both the plots side by side and view the differences in the distribution. We will use the gvisMerge() function to generate histograms side by side. In our recipe, we have two plots for AAPL and MSFT. The default setting plots each chart vertically but we can specify horizontal = true to plot charts horizontally. Making an interactive bubble plot My first encounter with a bubble plot was while watching a TED video of Hans Roslling. The video led me to search for creating bubble plots in R; a very good introduction to this is available on the Flowing Data website. The advantage of a bubble plot is that it allows us to visualize a third variable, which in our case would be the size of the bubble. In this recipe, I have made use of the googleVis package to plot a bubble plot but you can also implement this in R. The advantage of the Google Chart API is the interactivity and the ease with which they can be attached to a web page. Also note that we could also use squares instead of circles, but this is not implemented in the Google Chart API yet. In order to implement a bubble plot, I have downloaded the crime dataset by state. The details regarding the link and definition of crime data are available in the crime.txt file and are shown in the following screenshot: How to do it… As with all the plots in this article, we will install and load the googleVis Package. We will also import our data file in R using the read.csv() function: crm = read.csv("crimeusa.csv", header = TRUE, sep =",") We can construct our bubble chart using the gvisBubbleChart() function in R: bub1 = gvisBubbleChart(crm,idvar = "States",xvar= "Robbery", yvar="Burglary", sizevar ="Population", colorvar = "Year",options = list(legend = "none",width = 900, height = 600,title=" Crime per State in 2012", sizeAxis ="{maxSize : 40, minSize:0.5}",vAxis = "{title : 'Burglary'}",hAxis= "{title :'Robbery'}"))bub2 = gvisBubbleChart(crm,idvar = "States",xvar= "Robbery", yvar="Burglary",sizevar ="Population",options = list(legend = "none",width = 900, height = 600,title=" Crime per State in 2012", sizeAxis ="{maxSize : 40, minSize:0.5}",vAxis = "{title : 'Burglary'}",hAxis= "{title :'Robbery'}"))ata How it works… The gvisBubbleChart() function uses six attributes to create a bubble chart, which are as follows: data: This is the data defined as a data frame, in our example, crm idvar: This is the vector that is used to assign IDs to the bubbles, in our example, states xvar: This is the column in the data to plot on the x-axis, in our example, Robbery yvar: This is the column in the data to plot on the y-axis, in our example, Burglary sizevar: This is the column used to define the size of the bubble colorvar: This is the column used to define the color We can define the minimum and maximum sizes of each bubble using minSize and maxSize, respectively, under options(). Note that we have used gvisMerge to portray the differences among the bubble plots. In the plot on the right, we have not made use of colorvar and hence all the bubbles are of the same size. There's more… The Google Chart API makes it easier for us to plot a bubble, but the same can be achieved using the R basic plot function. We can make use of the symbols to create a plot. The symbols need not be a bubble; it can be a square as well. By this time, you should have watched Hans' TED lecture and would be wondering how you could create a motion chart with bubbles floating around. The Google Charts API has the ability to create motion charts and the readers can definitely use the googleVis reference manual to learn about this. See also TED video by Hans Rosling can be accessed at http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen The Flowing Data website generates bubble charts using the basic R plot function and can be accessed at http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/ Animated Bubble Chart by New York Times can be accessed at http://2010games.nytimes.com/medals/map.html Summary This article introduces some of the basic R plots, such as line and bar charts. It also discusses the basic elements of interactive plots using the googleVis package in R. This article is a great resource for understanding the basic R plotting techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Data visualization [article] Visualization as a Tool to Understand Data [article]
Read more
  • 0
  • 0
  • 2492

article-image-structural-equation-modeling-and-confirmatory-factor-analysis
Packt
06 Feb 2015
30 min read
Save for later

Structural Equation Modeling and Confirmatory Factor Analysis

Packt
06 Feb 2015
30 min read
In this article by Paul Gerrard and Radia M. Johnson, the authors of Mastering Scientific Computation with R, we'll discuss the fundamental ideas underlying structural equation modeling, which are often overlooked in other books discussing structural equation modeling (SEM) in R, and then delve into how SEM is done in R. We will then discuss two R packages, OpenMx and lavaan. We can directly apply our discussion of the linear algebra underlying SEM using OpenMx. Because of this, we will go over OpenMx first. We will then discuss lavaan, which is probably more user friendly because it sweeps the matrices and linear algebra representations under the rug so that they are invisible unless the user really goes looking for them. Both packages continue to be developed and there will always be some features better supported in one of these packages than in the other. (For more resources related to this topic, see here.) SEM model fitting and estimation methods To ultimately find a good solution, software has to use trial and error to come up with an implied covariance matrix that matches the observed covariance matrix as well as possible. The question is what does "as well as possible" mean? The answer to this is that the software must try to minimize some particular criterion, usually some sort of discrepancy function. Just what that criterion is depends on the estimation method used. The most commonly used estimation methods in SEM include: Ordinary least squares (OLS) also called unweighted least squares Generalized least squares (GLS) Maximum likelihood (ML) There are a number of other estimation methods as well, some of which can be done in R, but here we will stick with describing the most common ones. In general, OLS is the simplest and computationally cheapest estimation method. GLS is computationally more demanding, and ML is computationally more intensive. We will see why this is, as we discuss the details of these estimation methods. Any SEM estimation method seeks to estimate model parameters that recreate the observed covariance matrix as well as possible. To evaluate how closely an implied covariance matrix matches an observed covariance matrix, we need a discrepancy function. If we assume multivariate normality of the observed variables, the following function can be used to assess discrepancy: In the preceding figure, R is the observed covariance matrix, C is the implied covariance matrix, and V is a weight matrix. The tr function refers to the trace function, which sums the elements of the main diagonal. The choice of V varies based on the SEM estimation method: For OLS, V = I For GLS, V = R-1 In the case of an ML estimation, we seek to minimize one of a number of similar criteria to describe ML, as follows: In the preceding figure, n is the number of variables. There are a couple of points worth noting here. GLS estimation inverts the observed correlation matrix, something computationally demanding with large matrices, but something that must only be done once. Alternatively, ML requires inversion of the implied covariance matrix, which changes with each iteration. Thus, each iteration requires the computationally demanding step of matrix inversion. With modern fast computers, this difference may not be noticeable, but with large SEM models, this might start to be quite time-consuming. Assessing SEM model fit The final question in an SEM model is how well the model explains the data. This is answered with the use of SEM measures of fit. Most of these measures are based on a chi-squared distribution. The fit criteria for GLS and ML (as well as a number of other estimation procedures such as asymptotic distribution-free methods) multiplied by N-1 is approximately chi-square distributed. Here, the capital N represents the number of observations in the dataset, as opposed to lower case n, which gives the number of variables. We compute degrees of freedom as the difference between the number of estimated parameters and the number of known covariances (that is, the total number of values in one triangle of an observed covariance matrix). This gives way to the first test statistic for SEM models, a chi-squared significance level comparing our chi-square value to some minimum chi-square threshold to achieve statistical significance. As with conventional chi-square testing, a chi-square value that is higher than some minimal threshold will reject the null hypothesis. Most experimental science features such as rejection supports the hypothesis of the experiment. This is not the case in SEM, where the null hypothesis is that the model fits the data. Thus, a non-significant chi-square is an indicator of model fit, whereas a significant chi-square rejects model fit. A notable limitation of this is that a greater sample size, greater N, will increase the chi-square value and will therefore increase the power to reject model fit. Thus, using conventional chi-squared testing will tend to support models developed in small samples and reject models developed in large samples. The choice an interpretation of fit measures is a contentious one in SEM literature. However, as can be seen, chi-square has limitations. As such, other model fit criteria were developed that do not penalize models that fit in large samples (some may penalize models fit to small samples though). There are over a dozen indices, but the most common fit indices and interpretation information are as follows: Comparative fit index: In this index, a higher value is better. Conventionally, a value of greater than 0.9 was considered an indicator of good model fit, but some might argue that a value of at least 0.95 is needed. This is relatively sample size insensitive. Root mean square error of approximation: A value of under 0.08 (smaller is better) is often considered necessary to achieve model fit. However, this fit measure is quite sample size sensitive, penalizing small sample studies. Tucker-Lewis index (Non-normed fit index): This is interpreted in a similar manner as the comparative fit index. Also, this is not very sample size sensitive. Standardized root mean square residual: In this index, a lower value is better. A value of 0.06 or less is considered needed for model fit. Also, this may penalize small samples. In the next section, we will show you how to actually fit SEM models in R and how to evaluate fit using fit measures. Using OpenMx and matrix specification of an SEM We went through the basic principles of SEM and discussed the basic computational approach by which this can be achieved. SEM remains an active area of research (with an entire journal devoted to it, Structural Equation Modeling), so there are many additional peculiarities, but rather than delving into all of them, we will start by delving into actually fitting an SEM model in R. OpenMx is not in the CRAN repository, but it is easily obtainable from the OpenMx website, by typing the following in R: source('http://openmx.psyc.virginia.edu/getOpenMx.R')" Summarizing the OpenMx approach In this example, we will use OpenMx by specifying matrices as mentioned earlier. To fit an OpenMx model, we need to first specify the model and then tell the software to attempt to fit the model. Model specification involves four components: Specifying the model matrices; this has two parts: Declare starting values for the estimation Declaring which values can be estimated and which are fixed Telling OpenMx the algebraic relationship of the matrices that should produce an implied covariance matrix Giving an instruction for the model fitting criterion Providing a source of data The R commands that correspond to each of these steps are: mxMatrix mxAlgebra mxMLObjective mxData We will then pass the objects created with each of these commands to create an SEM model using mxModel. Explaining an entire example First, to make things simple, we will store the FALSE and TRUE logical values in single letter variables, which will be convenient when we have matrices full of TRUE and FALSE values as follows: F <- FALSE T <- TRUE Specifying the model matrices Specifying matrices is done with the mxMatrix function, which returns an MxMatrix object. (Note that the object starts with a capital "M" while the function starts with a lowercase "m.") Specifying an MxMatrix is much like specifying a regular R matrix, but MxMatrices has some additional components. The most notable difference is that there are actually two different matrices used to create an MxMatrix. The first is a matrix of starting values, and the second is a matrix that tells which starting values are free to be estimated and which are not. If a starting value is not freely estimable, then it is a fixed constant. Since the actual starting values that we choose do not really matter too much in this case, we will just pick one as a starting value for all parameters that we would like to be estimated. Let's take a look at the following example: mx.A <- mxMatrix( type = "Full", nrow=14, ncol=14, #Provide the Starting Values values = c(    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0 ), #Tell R which values are free to be estimated    free = c(    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, T, F ), byrow=TRUE,   #Provide a matrix name that will be used in model fitting name="A", ) We will now apply this same technique to the S matrix. Here, we will create two S matrices, S1 and S2. They differ simply in the starting values that they supply. We will later try to fit an SEM model using one matrix, and then the other to address problems with the first one. The difference is that S1 uses starting variances of 1 in the diagonal, and S2 uses starting variances of 5. Here, we will use the "symm" matrix type, which is a symmetric matrix. We could use the "full" matrix type, but by using "symm", we are saved from typing all of the symmetric values in the upper half of the matrix. Let's take a look at the following matrix: mx.S1 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    1,    0, 1,    0, 0, 1,    0, 1, 0, 1,    1, 0, 0, 0, 1,    0, 1, 0, 0, 0, 1,    0, 0, 1, 0, 0, 0, 1,    0, 0, 0, 1, 0, 1, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ),      free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" )   #The alternative, S2 matrix: mx.S2 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    5,    0, 5,    0, 0, 5,    0, 1, 0, 5,    1, 0, 0, 0, 5,    0, 1, 0, 0, 0, 5,    0, 0, 1, 0, 0, 0, 5,    0, 0, 0, 1, 0, 1, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 ),         free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" ) mx.Filter <- mxMatrix("Full", nrow=11, ncol=14, values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,      0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0    ),    free=FALSE,    name="Filter",    byrow = TRUE ) And finally, we will create our identity and filter matrices the same way, as follows: mx.I <- mxMatrix("Full", nrow=14, ncol=14,    values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1    ),    free=FALSE,    byrow = TRUE,    name="I" ) Fitting the model Now, it is time to declare the model that we would like to fit using the mxModel command. This part includes steps 2 through step 4 mentioned earlier. Here, we will tell mxModel which matrices to use. We will then use the mxAlgegra command to tell R how the matrices should be combined to reproduce the implied covariance matrix. We will tell R to use ML estimation with the mxMLObjective command, and we will tell it to apply the estimation to a particular matrix algebra, which we named "C". This is simply the right-hand side of the McArdle McDonald equation. Finally, we will tell R where to get the data to use in model fitting using the following code: factorModel.1 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S1, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),      mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) ) Now, let's tell R to fit the model and summarize the results using mxRun, as follows: summary(mxRun(factorModel.1)) Running Political Democracy Model Error in summary(mxRun(factorModel.1)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error: The job for model 'Political Democracy Model' exited abnormally with the error message: Expected covariance matrix is non-positive-definite. Uh oh! We got an error message telling us that the expected covariance matrix is not positive definite. Our observed covariance matrix is positive definite but the implied covariance matrix (at least at first) is not. This is an effect of the fact that if we multiply our starting value matrices together as specified by the McArdle McDonald equation, we get a starting implied covariance matrix. If we perform an eigenvalue decomposition of this starting implied covariance matrix, then we will find that the last eigenvalue is negative. This means a negative variance does not make much sense, and this is what "not positive definite" refers to. The good news is that this is simply our starting values, so we can fix this if we modify our starting values. In this case, we can choose values of five along the diagonal of the S matrix, and get a positive definite starting implied covariance matrix. We can rerun this using the mx.S2 matrix specified earlier and the software will proceed as follows: #Rerun with a positive definite matrix   factorModel.2 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S2, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),    mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) )   summary(mxRun(factorModel.2)) This should provide a solution. As can be seen from the previous code, the parameters solved in the model are returned as matrix components. Just like we had to figure out how to go from paths to matrices, we now have to figure out how to go from matrices to paths (the reverse problem). In the following screenshot, we show just the first few free parameters: The preceding screenshot tells us that the parameter estimated in the position of the tenth row and twelfth column in the matrix A is 2.18. This corresponds to a path from the twelfth variable in the A matrix ind60, to the 10th variable in the matrix x2. Thus, the path coefficient from ind60 to x2 is 2.18. There are a few other pieces of information here. The first one tells us that the model has not converged but is "Mx status Green." This means that the model was still converging when it stopped running (that is, it did not converge), but an optimal solution was still found and therefore, the results are likely reliable. Model fit information is also provided suggesting a pretty good model fit with CFI of 0.99 and RMSEA of 0.032. This was a fair amount of work, and creating model matrices by hand from path diagrams can be quite tedious. For this reason, SEM fitting programs have generally adopted the ability to fit SEM by declaring paths rather than model matrices. OpenMx has the ability to allow declaration by paths, but applying model matrices has a few advantages. Principally, we get under the hood of SEM fitting. If we step back, we can see that OpenMx actually did very little for us that is specific to SEM. We told OpenMx how we wanted matrices multiplied together and which parameters of the matrix were free to be estimated. Instead of using the RAM specification, we could have passed the matrices of the LISREL or Bentler-Weeks models with the corresponding algebra methods to recreate an implied covariance matrix. This means that if we are trying to come up with our matrix specification, reproduce prior research, or apply a new SEM matrix specification method published in the literature, OpenMx gives us the power to do it. Also, for educators wishing to teach the underlying mathematical ideas of SEM, OpenMx is a very powerful tool. Fitting SEM models using lavaan If we were to describe OpenMx as the SEM equivalent of having a well-stocked pantry and full kitchen to create whatever you want, and you have the time and know how to do it, we might regard lavaan as a large freezer full of prepackaged microwavable dinners. It does not allow quite as much flexibility as OpenMx because it sweeps much of the work that we did by hand in OpenMx under the rug. Lavaan does use an internal matrix representation, but the user never has to see it. It is this sweeping under the rug that makes lavaan generally much easier to use. It is worth adding that the list of prepackaged features that are built into lavaan with minimal additional programming challenge many commercial SEM packages. The lavaan syntax The key to describing lavaan models is the model syntax, as follows: X =~ Y: Y is a manifestation of the latent variable X Y ~ X: Y is regressed on X Y ~~ X: The covariance between Y and X can be estimated Y ~ 1: This estimates the intercept for Y (implicitly requires mean structure) Y | a*t1 + b*t2: Y has two thresholds that is a and b Y ~ a * X: Y is regressed on X with coefficient a Y ~ start(a) * X: Y is regressed on X; the starting value used for estimation is a It may not be evident at first, but this model description language actually makes lavaan quite powerful. Wherever you have seen a or b in the previous examples, a variable or constant can be used in their place. The beauty of this is that multiple parameters can be constrained to be equal simply by assigning a single parameter name to them. Using lavaan, we can fit a factor analysis model to our physical functioning dataset with only a few lines of code: phys.func.data <- read.csv('phys_func.csv')[-1] names(phys.func.data) <- LETTERS[1:20] R has a built-in vector named LETTERS, which contains all of the capital letters of the English alphabet. The lower case vector letters contains the lowercase alphabet. We will then describe our model using the lavaan syntax. Here, we have a model of three latent variables, our factors, and each of them has manifest variables. Let's take a look at the following example: model.definition.1 <- ' #Factors    Cognitive =~ A + Q + R + S    Legs =~ B + C + D + H + I + J + M + N    Arms =~ E + F+ G + K +L + O + P + T    #Correlations Between Factors    Cognitive ~~ Legs    Cognitive ~~ Arms    Legs ~~ Arms ' We then tell lavaan to fit the model as follows: fit.phys.func <- cfa(model.definition.1, data=phys.func.data, ordered= c('A','B', 'C','D', 'E','F','G', 'H','I','J', 'K', 'L','M','N','O','P','Q','R', 'S', 'T')) In the previous code, we add an ordered = argument, which tells lavaan that some variables are ordinal in nature. In response, lavaan estimates polychoric correlations for these variables. Polychoric correlations assume that we binned a continuous variable into discrete categories, and attempts to explicitly model correlations assuming that there is some continuous underlying variable. Part of this requires finding thresholds (placed on an arbitrary scale) between each categorical response. (for example, threshold 1 falls between the response of 1 and 2, and so on). By telling lavaan to treat some variables as categorical, lavaan will also know to use a special estimation method. Lavaan will use diagonally weighted least squares, which does not assume normality and uses the diagonals of the polychoric correlation matrix for weights in the discrepancy function. With five response options, it is questionable as to whether polychoric correlations are truly needed. Some analysts might argue that with many response options, the data can be treated as continuous, but here we use this method to show off lavaan's capabilities. All SEM models in lavaan use the lavaan command. Here, we use the cfa command, which is one of a number of wrapper functions for the lavaan command. Others include sem and growth. These commands differ in the default options passed to the lavaan command. (For full details, see the package documentation.) Summarizing the data, we can see the loadings of each item on the factor as well as the factor intercorrelations. We can also see the thresholds between each category from the polychoric correlations as follows: summary(fit.phys.func) We can also assess things such as model fit using the fitMeasures command, which has most of the popularly used fit measures and even a few obscure ones. Here, we tell lavaan to simply extract three measures of model fit as follows: fitMeasures(fit.phys.func, c('rmsea', 'cfi', 'srmr')) Collectively, these measures suggest adequate model fit. It is worth noting here that the interpretation of fit measures largely comes from studies using maximum likelihood estimation, and there is some debate as to how well these generalize other fitting methods. The lavaan package also has the capability to use other estimators that treat the data as truly continuous in nature. For this, a particular dataset is far from multivariate normal distributed, so an estimator such as ML is appropriate to use. However, if we wanted to do so, the syntax would be as follows: fit.phys.func.ML <- cfa(model.definition.1, data=phys.func.data, estimator = 'ML') Comparing OpenMx to lavaan It can be seen that lavaan has a much simpler syntax that allows to rapidly model basic SEM models. However, we were a bit unfair to OpenMx because we used a path model specification for lavaan and a matrix specification for OpenMx. The truth is that OpenMx is still probably a bit wordier than lavaan, but let's apply a path model specification in each to do a fair head-to-head comparison. We will use the famous Holzinger-Swineford 1939 dataset here from the lavaan package to do our modeling, as follows: hs.dat <- HolzingerSwineford1939 We will create a new dataset with a shorter name so that we don't have to keep typing HozlingerSwineford1939. Explaining an example in lavaan We will learn to fit the Holzinger-Swineford model in this section. We will start by specifying the SEM model using the lavaan model syntax: hs.model.lavaan <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed   =~ x7 + x8 + x9   visual ~~ textual visual ~~ speed textual ~~ speed '   fit.hs.lavaan <- cfa(hs.model.lavaan, data=hs.dat, std.lv = TRUE) summary(fit.hs.lavaan) Here, we add the std.lv argument to the fit function, which fixes the variance of the latent variables to 1. We do this instead of constraining the first factor loading on each variable to 1. Only the model coefficients are included for ease of viewing in this book. The result is shown in the following model: > summary(fit.hs.lavaan) …                      Estimate Std.err Z-value P(>|z|) Latent variables: visual =~    x1               0.900   0.081   11.127   0.000    x2               0.498   0.077   6.429   0.000    x3              0.656   0.074   8.817   0.000 textual =~    x4               0.990   0.057   17.474   0.000    x5               1.102   0.063   17.576   0.000    x6               0.917   0.054   17.082   0.000 speed =~    x7               0.619   0.070   8.903   0.000    x8               0.731   0.066   11.090   0.000    x9               0.670   0.065   10.305   0.000   Covariances: visual ~~    textual           0.459   0.064   7.189   0.000    speed             0.471   0.073   6.461   0.000 textual ~~    speed             0.283   0.069   4.117   0.000 Let's compare these results with a model fit in OpenMx using the same dataset and SEM model. Explaining an example in OpenMx The OpenMx syntax for path specification is substantially longer and more explicit. Let's take a look at the following model: hs.model.open.mx <- mxModel("Holzinger Swineford", type="RAM",      manifestVars = names(hs.dat)[7:15], latentVars = c('visual', 'textual', 'speed'),    # Create paths from latent to observed variables mxPath(        from = 'visual',        to = c('x1', 'x2', 'x3'),    free = c(TRUE, TRUE, TRUE),    values = 1          ), mxPath(        from = 'textual',        to = c('x4', 'x5', 'x6'),        free = c(TRUE, TRUE, TRUE),        values = 1      ), mxPath(    from = 'speed',    to = c('x7', 'x8', 'x9'),    free = c(TRUE, TRUE, TRUE),    values = 1      ), # Create covariances among latent variables mxPath(    from = 'visual',    to = 'textual',    arrows=2,    free=TRUE      ), mxPath(        from = 'visual',        to = 'speed',        arrows=2,        free=TRUE      ), mxPath(        from = 'textual',        to = 'speed',        arrows=2,        free=TRUE      ), #Create residual variance terms for the latent variables mxPath(    from= c('visual', 'textual', 'speed'),    arrows=2, #Here we are fixing the latent variances to 1 #These two lines are like st.lv = TRUE in lavaan    free=c(FALSE,FALSE,FALSE),    values=1 ), #Create residual variance terms mxPath( from= c('x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'),    arrows=2, ),    mxData(        observed=cov(hs.dat[,c(7:15)]),        type="cov",        numObs=301    ) )     fit.hs.open.mx <- mxRun(hs.model.open.mx) summary(fit.hs.open.mx) Here are the results of the OpenMx model fit, which look very similar to lavaan's. This gives a long output. For ease of viewing, only the most relevant parts of the output are included in the following model (the last column that R prints giving the standard error of estimates is also not shown here): > summary(fit.hs.open.mx) …   free parameters:                            name matrix     row     col Estimate Std.Error 1   Holzinger Swineford.A[1,10]     A     x1 visual 0.9011177 2   Holzinger Swineford.A[2,10]     A     x2 visual 0.4987688 3   Holzinger Swineford.A[3,10]     A     x3 visual 0.6572487 4   Holzinger Swineford.A[4,11]     A     x4 textual 0.9913408 5   Holzinger Swineford.A[5,11]     A     x5 textual 1.1034381 6   Holzinger Swineford.A[6,11]     A     x6 textual 0.9181265 7   Holzinger Swineford.A[7,12]     A     x7   speed 0.6205055 8   Holzinger Swineford.A[8,12]     A     x8 speed 0.7321655 9   Holzinger Swineford.A[9,12]     A     x9   speed 0.6710954 10   Holzinger Swineford.S[1,1]     S     x1     x1 0.5508846 11   Holzinger Swineford.S[2,2]     S     x2     x2 1.1376195 12   Holzinger Swineford.S[3,3]     S    x3     x3 0.8471385 13   Holzinger Swineford.S[4,4]     S     x4     x4 0.3724102 14   Holzinger Swineford.S[5,5]     S     x5     x5 0.4477426 15   Holzinger Swineford.S[6,6]     S     x6     x6 0.3573899 16   Holzinger Swineford.S[7,7]      S     x7     x7 0.8020562 17   Holzinger Swineford.S[8,8]     S     x8     x8 0.4893230 18   Holzinger Swineford.S[9,9]     S     x9     x9 0.5680182 19 Holzinger Swineford.S[10,11]     S visual textual 0.4585093 20 Holzinger Swineford.S[10,12]     S visual   speed 0.4705348 21 Holzinger Swineford.S[11,12]     S textual   speed 0.2829848 In summary, the results agree quite closely. For example, looking at the coefficient for the path going from the latent variable visual to the observed variable x1, lavaan gives an estimate of 0.900 while OpenMx computes a value of 0.901. Summary The lavaan package is user friendly, pretty powerful, and constantly adding new features. Alternatively, OpenMx has a steeper learning curve but tremendous flexibility in what it can do. Thus, lavaan is a bit like a large freezer full of prepackaged microwavable dinners, whereas OpenMx is like a well-stocked pantry with no prepared foods but a full kitchen that will let you prepare it if you have the time and the know-how. To run a quick analysis, it is tough to beat the simplicity of lavaan, especially given its wide range of capabilities. For large complex models, OpenMx may be a better choice. The methods covered here are useful to analyze statistical relationships when one has all of the data from events that have already occurred. Resources for Article: Further resources on this subject: Creating your first heat map in R [article] Going Viral [article] Introduction to S4 Classes [article]
Read more
  • 0
  • 0
  • 6841

article-image-postgresql-cookbook-high-availability-and-replication
Packt
06 Feb 2015
26 min read
Save for later

PostgreSQL Cookbook - High Availability and Replication

Packt
06 Feb 2015
26 min read
In this article by Chitij Chauhan, author of the book PostgreSQL Cookbook, we will talk about various high availability and replication solutions, including some popular third-party replication tools such as Slony-I and Londiste. In this article, we will cover the following recipes: Setting up hot streaming replication Replication using Slony-I Replication using Londiste The important components for any production database is to achieve fault tolerance, 24/7 availability, and redundancy. It is for this purpose that we have different high availability and replication solutions available for PostgreSQL. From a business perspective, it is important to ensure 24/7 data availability in the event of a disaster situation or a database crash due to disk or hardware failure. In such situations, it becomes critical to ensure that a duplicate copy of the data is available on a different server or a different database, so that seamless failover can be achieved even when the primary server/database is unavailable. Setting up hot streaming replication In this recipe, we are going to set up a master-slave streaming replication. Getting ready For this exercise, you will need two Linux machines, each with the latest version of PostgreSQL installed. We will be using the following IP addresses for the master and slave servers: Master IP address: 192.168.0.4 Slave IP address: 192.168.0.5 Before you start with the master-slave streaming setup, it is important that the SSH connectivity between the master and slave is setup. How to do it... Perform the following sequence of steps to set up a master-slave streaming replication: First, we are going to create a user on the master, which will be used by the slave server to connect to the PostgreSQL database on the master server: psql -c "CREATE USER repuser REPLICATION LOGIN ENCRYPTED PASSWORD 'charlie';" Next, we will allow the replication user that was created in the previous step to allow access to the master PostgreSQL server. This is done by making the necessary changes as mentioned in the pg_hba.conf file: Vi pg_hba.conf host   replication   repuser   192.168.0.5/32   md5 In the next step, we are going to configure parameters in the postgresql.conf file. These parameters need to be set in order to get the streaming replication working: Vi /var/lib/pgsql/9.3/data/postgresql.conf listen_addresses = '*' wal_level = hot_standby max_wal_senders = 3 wal_keep_segments = 8 archive_mode = on       archive_command = 'cp %p /var/lib/pgsql/archive/%f && scp %p postgres@192.168.0.5:/var/lib/pgsql/archive/%f' checkpoint_segments = 8 Once the parameter changes have been made in the postgresql.conf file in the previous step, the next step will be to restart the PostgreSQL server on the master server, in order to let the changes take effect: pg_ctl -D /var/lib/pgsql/9.3/data restart Before the slave can replicate the master, we will need to give it the initial database to build off. For this purpose, we will make a base backup by copying the primary server's data directory to the standby. The rsync command needs to be run as a root user: psql -U postgres -h 192.168.0.4 -c "SELECT pg_start_backup('label', true)" rsync -a /var/lib/pgsql/9.3/data/ 192.168.0.5:/var/lib/pgsql/9.3/data/ --exclude postmaster.pid psql -U postgres -h 192.168.0.4 -c "SELECT pg_stop_backup()" Once the data directory, mentioned in the previous step, is populated, the next step is to enable the following parameter in the postgresql.conf file on the slave server: hot_standby = on The next step will be to copy the recovery.conf.sample file in the $PGDATA location on the slave server and then configure the following parameters: cp /usr/pgsql-9.3/share/recovery.conf.sample /var/lib/pgsql/9.3/data/recovery.conf standby_mode = on primary_conninfo = 'host=192.168.0.4 port=5432 user=repuser password=charlie' trigger_file = '/tmp/trigger.replication′ restore_command = 'cp /var/lib/pgsql/archive/%f "%p"' The next step will be to start the slave server: service postgresql-9.3 start Now that the above mentioned replication steps are set up, we will test for replication. On the master server, log in and issue the following SQL commands: psql -h 192.168.0.4 -d postgres -U postgres -W postgres=# create database test;   postgres=# c test;   test=# create table testtable ( testint int, testchar varchar(40) );   CREATE TABLE test=# insert into testtable values ( 1, 'What A Sight.' ); INSERT 0 1 On the slave server, we will now check whether the newly created database and the corresponding table, created in the previous step, are replicated: psql -h 192.168.0.5 -d test -U postgres -W test=# select * from testtable; testint | testchar ---------+--------------------------- 1 | What A Sight. (1 row) How it works... The following is the explanation for the steps performed in the preceding section. In the initial step of the preceding section, we create a user called repuser, which will be used by the slave server to make a connection to the primary server. In the second step of the preceding section, we make the necessary changes in the pg_hba.conf file to allow the master server to be accessed by the slave server using the repuser user ID that was created in step 1. We then make the necessary parameter changes on the master in step 3 of the preceding section to configure a streaming replication. The following is a description of these parameters: listen_addresses: This parameter is used to provide the IP address associated with the interface that you want to have PostgreSQL listen to. A value of * indicates all available IP addresses. wal_level: This parameter determines the level of WAL logging done. Specify hot_standby for streaming replication. wal_keep_segments: This parameter specifies the number of 16 MB WAL files to be retained in the pg_xlog directory. The rule of thumb is that more such files might be required to handle a large checkpoint. archive_mode: Setting this parameter enables completed WAL segments to be sent to the archive storage. archive_command: This parameter is basically a shell command that is executed whenever a WAL segment is completed. In our case, we are basically copying the file to the local machine and then using the secure copy command to send it across to the slave. max_wal_senders: This parameter specifies the total number of concurrent connections allowed from the slave servers. checkpoint_segments: This parameter specifies the maximum number of logfile segments between automatic WAL checkpoints. Once the necessary configuration changes have been made on the master server, we then restart the PostgreSQL server on the master in order to let the new configuration changes take effect. This is done in step 4 of the preceding section. In step 5 of the preceding section, we are basically building the slave by copying the primary server's data directory to the slave. Now, with the data directory available on the slave, the next step is to configure it. We will now make the necessary parameter replication related parameter changes on the slave in the postgresql.conf directory on the slave server. We set the following parameters on the slave: hot_standby: This parameter determines whether you can connect and run queries when the server is in the archive recovery or standby mode. In the next step, we are configuring the recovery.conf file. This is required to be set up so that the slave can start receiving logs from the master. The parameters explained next are configured in the recovery.conf file on the slave. standby_mode: This parameter, when enabled, causes PostgreSQL to work as a standby in a replication configuration. primary_conninfo: This parameter specifies the connection information used by the slave to connect to the master. For our scenario, our master server is set as 192.168.0.4 on port 5432 and we are using the repuser userid with the password charlie to make a connection to the master. Remember that repuser was the userid which was created in the initial step of the preceding section for this purpose, that is, connecting to the master from the slave. trigger_file: When a slave is configured as a standby, it will continue to restore the XLOG records from the master. The trigger_file parameter specifies what is used to trigger a slave, in order to switch over its duties from standby and take over as master or primary server. At this stage, the slave is fully configured now and we can start the slave server; then, the replication process begins. This is shown in step 8 of the preceding section. In steps 9 and 10 of the preceding section, we are simply testing our replication. We first begin by creating a test database, then we log in to the test database and create a table by the name testtable, and then we begin inserting some records into the testtable table. Now, our purpose is to see whether these changes are replicated across the slave. To test this, we log in to the slave on the test database and then query the records from the testtable table, as seen in step 10 of the preceding section. The final result that we see is that all the records that are changed/inserted on the primary server are visible on the slave. This completes our streaming replication's setup and configuration. Replication using Slony-I Here, we are going to set up replication using Slony-I. We will be setting up the replication of table data between two databases on the same server. Getting ready The steps performed in this recipe are carried out on a CentOS Version 6 machine. It is also important to remove the directives related to hot streaming replication prior to setting up replication using Slony-I. We will first need to install Slony-I. The following steps need to be performed in order to install Slony-I: First, go to http://slony.info/downloads/2.2/source/ and download the given software. Once you have downloaded the Slony-I software, the next step is to unzip the .tar file and then go the newly created directory. Before doing this, please ensure that you have the postgresql-devel package for the corresponding PostgreSQL version installed before you install Slony-I: tar xvfj slony1-2.2.3.tar.bz2  cd slony1-2.2.3 In the next step, we are going to configure, compile, and build the software: ./configure --with-pgconfigdir=/usr/pgsql-9.3/bin/ make make install How to do it... You need to perform the following sequence of steps, in order to replicate data between two tables using Slony-I replication: First, start the PostgreSQL server if you have not already started it: pg_ctl -D $PGDATA start In the next step, we will be creating two databases, test1 and test2, which will be used as the source and target databases respectively: createdb test1 createdb test2 In the next step, we will create the t_test table on the source database, test1, and insert some records into it: psql -d test1 test1=# create table t_test (id numeric primary key, name varchar);   test1=# insert into t_test values(1,'A'),(2,'B'), (3,'C'); We will now set up the target database by copying the table definitions from the test1 source database: pg_dump -s -p 5432 -h localhost test1 | psql -h localhost -p 5432 test2 We will now connect to the target database, test2, and verify that there is no data in the tables of the test2 database: psql -d test2 test2=# select * from t_test; We will now set up a slonik script for the master-slave, that is source/target, setup. In this scenario, since we are replicating between two different databases on the same server, the only different connection string option will be the database name: cd /usr/pgsql-9.3/bin vi init_master.slonik   #!/bin/sh cluster name = mycluster; node 1 admin conninfo = 'dbname=test1 host=localhost port=5432 user=postgres password=postgres'; node 2 admin conninfo = 'dbname=test2 host=localhost port=5432 user=postgres password=postgres'; init cluster ( id=1); create set (id=1, origin=1); set add table(set id=1, origin=1, id=1, fully qualified name = 'public.t_test'); store node (id=2, event node = 1); store path (server=1, client=2, conninfo='dbname=test1 host=localhost port=5432 user=postgres password=postgres'); store path (server=2, client=1, conninfo='dbname=test2 host=localhost port=5432 user=postgres password=postgres'); store listen (origin=1, provider = 1, receiver = 2);  store listen (origin=2, provider = 2, receiver = 1); We will now create a slonik script for subscription to the slave, that is, target: cd /usr/pgsql-9.3/bin vi init_slave.slonik #!/bin/sh cluster name = mycluster; node 1 admin conninfo = 'dbname=test1 host=localhost port=5432 user=postgres password=postgres'; node 2 admin conninfo = 'dbname=test2 host=localhost port=5432 user=postgres password=postgres'; subscribe set ( id = 1, provider = 1, receiver = 2, forward = no); We will now run the init_master.slonik script created in step 6 and run this on the master, as follows: cd /usr/pgsql-9.3/bin   slonik init_master.slonik We will now run the init_slave.slonik script created in step 7 and run this on the slave, that is, target: cd /usr/pgsql-9.3/bin   slonik init_slave.slonik In the next step, we will start the master slon daemon: nohup slon mycluster "dbname=test1 host=localhost port=5432 user=postgres password=postgres" & In the next step, we will start the slave slon daemon: nohup slon mycluster "dbname=test2 host=localhost port=5432 user=postgres password=postgres" & Next, we will connect to the master, that is, the test1 source database, and insert some records in the t_test table: psql -d test1 test1=# insert into t_test values (5,'E'); We will now test for the replication by logging on to the slave, that is, the test2 target database, and see whether the inserted records in the t_test table are visible: psql -d test2   test2=# select * from t_test; id | name ----+------ 1 | A 2 | B 3 | C 5 | E (4 rows) How it works... We will now discuss the steps performed in the preceding section: In step 1, we first start the PostgreSQL server if it is not already started. In step 2, we create two databases, namely test1 and test2, that will serve as our source (master) and target (slave) databases. In step 3, we log in to the test1 source database, create a t_test table, and insert some records into the table. In step 4, we set up the target database, test2, by copying the table definitions present in the source database and loading them into test2 using the pg_dump utility. In step 5, we log in to the target database, test2, and verify that there are no records present in the t_test table because in step 4, we only extracted the table definitions into the test2 database from the test1 database. In step 6, we set up a slonik script for the master-slave replication setup. In the init_master.slonik file, we first define the cluster name as mycluster. We then define the nodes in the cluster. Each node will have a number associated with a connection string, which contains database connection information. The node entry is defined both for the source and target databases. The store_path commands are necessary, so that each node knows how to communicate with the other. In step 7, we set up a slonik script for the subscription of the slave, that is, the test2 target database. Once again, the script contains information such as the cluster name and the node entries that are designated a unique number related to connection string information. It also contains a subscriber set. In step 8, we run the init_master.slonik file on the master. Similarly, in step 9, we run the init_slave.slonik file on the slave. In step 10, we start the master slon daemon. In step 11, we start the slave slon daemon. The subsequent steps, 12 and 13, are used to test for replication. For this purpose, in step 12 of the preceding section, we first log in to the test1 source database and insert some records into the t_test table. To check whether the newly inserted records have been replicated in the target database, test2, we log in to the test2 database in step 13. The result set obtained from the output of the query confirms that the changed/inserted records on the t_test table in the test1 database are successfully replicated across the target database, test2. For more information on Slony-I replication, go to http://slony.info/documentation/tutorial.html. There's more... If you are using Slony-I for replication between two different servers, in addition to the steps mentioned in the How to do it… section, you will also have to enable authentication information in the pg_hba.conf file existing on both the source and target servers. For example, let's assume that the source server's IP is 192.168.16.44 and the target server's IP is 192.168.16.56 and we are using a user named super to replicate the data. If this is the situation, then in the source server's pg_hba.conf file, we will have to enter the information, as follows: host         postgres         super     192.168.16.44/32           md5 Similarly, in the target server's pg_hba.conf file, we will have to enter the authentication information, as follows: host         postgres         super     192.168.16.56/32           md5 Also, in the shell scripts that were used for Slony-I, wherever the connection information for the host is localhost that entry will need to be replaced by the source and target server's IP addresses. Replication using Londiste In this recipe, we are going to show you how to replicate data using Londiste. Getting ready For this setup, we are using the same host CentOS Linux machine to replicate data between two databases. This can also be set up using two separate Linux machines running on VMware, VirtualBox, or any other virtualization software. It is assumed that the latest version of PostgreSQL, version 9.3, is installed. We used CentOS Version 6 as the Linux operating system for this exercise. To set up Londiste replication on the Linux machine, perform the following steps: Go to http://pgfoundry.org/projects/skytools/ and download the latest version of Skytools 3.2, that is, tarball skytools-3.2.tar.gz. Extract the tarball file, as follows: tar -xvzf skytools-3.2.tar.gz Go to the new location and build and compile the software: cd skytools-3.2 ./configure --prefix=/var/lib/pgsql/9.3/Sky –with-pgconfig=/usr/pgsql-9.3/bin/pg_config   make   make install Also, set the PYTHONPATH environment variable, as shown here. Alternatively, you can also set it in the .bash_profile script: export PYTHONPATH=/opt/PostgreSQL/9.2/Sky/lib64/python2.6/site-packages/ How to do it... We are going to perform the following sequence of steps to set up replication between two different databases using Londiste. First, create the two databases between which replication has to occur: createdb node1 createdb node2 Populate the node1 database with data using the pgbench utility: pgbench -i -s 2 -F 80 node1 Add any primary key and foreign keys to the tables in the node1 database that are needed for replication. Create the following .sql file and add the following lines to it: Vi /tmp/prepare_pgbenchdb_for_londiste.sql -- add primary key to history table ALTER TABLE pgbench_history ADD COLUMN hid SERIAL PRIMARY KEY;   -- add foreign keys ALTER TABLE pgbench_tellers ADD CONSTRAINT pgbench_tellers_branches_fk FOREIGN KEY(bid) REFERENCES pgbench_branches; ALTER TABLE pgbench_accounts ADD CONSTRAINT pgbench_accounts_branches_fk FOREIGN KEY(bid) REFERENCES pgbench_branches; ALTER TABLE pgbench_history ADD CONSTRAINT pgbench_history_branches_fk FOREIGN KEY(bid) REFERENCES pgbench_branches; ALTER TABLE pgbench_history ADD CONSTRAINT pgbench_history_tellers_fk FOREIGN KEY(tid) REFERENCES pgbench_tellers; ALTER TABLE pgbench_history ADD CONSTRAINT pgbench_history_accounts_fk FOREIGN KEY(aid) REFERENCES pgbench_accounts; We will now load the .sql file created in the previous step and load it into the database: psql node1 -f /tmp/prepare_pgbenchdb_for_londiste.sql We will now populate the node2 database with table definitions from the tables in the node1 database: pg_dump -s -t 'pgbench*' node1 > /tmp/tables.sql psql -f /tmp/tables.sql node2 Now starts the process of replication. We will first create the londiste.ini configuration file with the following parameters in order to set up the root node for the source database, node1: Vi londiste.ini   [londiste3] job_name = first_table db = dbname=node1 queue_name = replication_queue logfile = /home/postgres/log/londiste.log pidfile = /home/postgres/pid/londiste.pid In the next step, we are going to use the londiste.ini configuration file created in the previous step to set up the root node for the node1 database, as shown here: [postgres@localhost bin]$ ./londiste3 londiste3.ini create-root node1 dbname=node1   2014-12-09 18:54:34,723 2335 WARNING No host= in public connect string, bad idea 2014-12-09 18:54:35,210 2335 INFO plpgsql is installed 2014-12-09 18:54:35,217 2335 INFO pgq is installed 2014-12-09 18:54:35,225 2335 INFO pgq.get_batch_cursor is installed 2014-12-09 18:54:35,227 2335 INFO pgq_ext is installed 2014-12-09 18:54:35,228 2335 INFO pgq_node is installed 2014-12-09 18:54:35,230 2335 INFO londiste is installed 2014-12-09 18:54:35,232 2335 INFO londiste.global_add_table is installed 2014-12-09 18:54:35,281 2335 INFO Initializing node 2014-12-09 18:54:35,285 2335 INFO Location registered 2014-12-09 18:54:35,447 2335 INFO Node "node1" initialized for queue "replication_queue" with type "root" 2014-12-09 18:54:35,465 2335 INFO Don We will now run the worker daemon for the root node: [postgres@localhost bin]$ ./londiste3 londiste3.ini worker 2014-12-09 18:55:17,008 2342 INFO Consumer uptodate = 1 In the next step, we will create a slave.ini configuration file in order to make a leaf node for the node2 target database: Vi slave.ini [londiste3] job_name = first_table_slave db = dbname=node2 queue_name = replication_queue logfile = /home/postgres/log/londiste_slave.log pidfile = /home/postgres/pid/londiste_slave.pid We will now initialize the node in the target database: ./londiste3 slave.ini create-leaf node2 dbname=node2 –provider=dbname=node1 2014-12-09 18:57:22,769 2408 WARNING No host= in public connect string, bad idea 2014-12-09 18:57:22,778 2408 INFO plpgsql is installed 2014-12-09 18:57:22,778 2408 INFO Installing pgq 2014-12-09 18:57:22,778 2408 INFO   Reading from /var/lib/pgsql/9.3/Sky/share/skytools3/pgq.sql 2014-12-09 18:57:23,211 2408 INFO pgq.get_batch_cursor is installed 2014-12-09 18:57:23,212 2408 INFO Installing pgq_ext 2014-12-09 18:57:23,213 2408 INFO   Reading from /var/lib/pgsql/9.3/Sky/share/skytools3/pgq_ext.sql 2014-12-09 18:57:23,454 2408 INFO Installing pgq_node 2014-12-09 18:57:23,455 2408 INFO   Reading from /var/lib/pgsql/9.3/Sky/share/skytools3/pgq_node.sql 2014-12-09 18:57:23,729 2408 INFO Installing londiste 2014-12-09 18:57:23,730 2408 INFO   Reading from /var/lib/pgsql/9.3/Sky/share/skytools3/londiste.sql 2014-12-09 18:57:24,391 2408 INFO londiste.global_add_table is installed 2014-12-09 18:57:24,575 2408 INFO Initializing node 2014-12-09 18:57:24,705 2408 INFO Location registered 2014-12-09 18:57:24,715 2408 INFO Location registered 2014-12-09 18:57:24,744 2408 INFO Subscriber registered: node2 2014-12-09 18:57:24,748 2408 INFO Location registered 2014-12-09 18:57:24,750 2408 INFO Location registered 2014-12-09 18:57:24,757 2408 INFO Node "node2" initialized for queue "replication_queue" with type "leaf" 2014-12-09 18:57:24,761 2408 INFO Done We will now launch the worker daemon for the target database, that is, node2: [postgres@localhost bin]$ ./londiste3 slave.ini worker 2014-12-09 18:58:53,411 2423 INFO Consumer uptodate = 1 We will now create the configuration file, that is pgqd.ini, for the ticker daemon: vi pgqd.ini   [pgqd] logfile = /home/postgres/log/pgqd.log pidfile = /home/postgres/pid/pgqd.pid Using the configuration file created in the previous step, we will launch the ticker daemon: [postgres@localhost bin]$ ./pgqd pgqd.ini 2014-12-09 19:05:56.843 2542 LOG Starting pgqd 3.2 2014-12-09 19:05:56.844 2542 LOG auto-detecting dbs ... 2014-12-09 19:05:57.257 2542 LOG node1: pgq version ok: 3.2 2014-12-09 19:05:58.130 2542 LOG node2: pgq version ok: 3.2 We will now add all the tables to the replication on the root node: [postgres@localhost bin]$ ./londiste3 londiste3.ini add-table --all 2014-12-09 19:07:26,064 2614 INFO Table added: public.pgbench_accounts 2014-12-09 19:07:26,161 2614 INFO Table added: public.pgbench_branches 2014-12-09 19:07:26,238 2614 INFO Table added: public.pgbench_history 2014-12-09 19:07:26,287 2614 INFO Table added: public.pgbench_tellers Similarly, add all the tables to the replication on the leaf node: [postgres@localhost bin]$ ./londiste3 slave.ini add-table –all We will now generate some traffic on the node1 source database: pgbench -T 10 -c 5 node1 We will now use the compare utility available with the londiste3 command to check the tables in both the nodes; that is, both the source database (node1) and destination database (node2) have the same amount of data: [postgres@localhost bin]$ ./londiste3 slave.ini compare   2014-12-09 19:26:16,421 2982 INFO Checking if node1 can be used for copy 2014-12-09 19:26:16,424 2982 INFO Node node1 seems good source, using it 2014-12-09 19:26:16,425 2982 INFO public.pgbench_accounts: Using node node1 as provider 2014-12-09 19:26:16,441 2982 INFO Provider: node1 (root) 2014-12-09 19:26:16,446 2982 INFO Locking public.pgbench_accounts 2014-12-09 19:26:16,447 2982 INFO Syncing public.pgbench_accounts 2014-12-09 19:26:18,975 2982 INFO Counting public.pgbench_accounts 2014-12-09 19:26:19,401 2982 INFO srcdb: 200000 rows, checksum=167607238449 2014-12-09 19:26:19,706 2982 INFO dstdb: 200000 rows, checksum=167607238449 2014-12-09 19:26:19,715 2982 INFO Checking if node1 can be used for copy 2014-12-09 19:26:19,716 2982 INFO Node node1 seems good source, using it 2014-12-09 19:26:19,716 2982 INFO public.pgbench_branches: Using node node1 as provider 2014-12-09 19:26:19,730 2982 INFO Provider: node1 (root) 2014-12-09 19:26:19,734 2982 INFO Locking public.pgbench_branches 2014-12-09 19:26:19,734 2982 INFO Syncing public.pgbench_branches 2014-12-09 19:26:22,772 2982 INFO Counting public.pgbench_branches 2014-12-09 19:26:22,804 2982 INFO srcdb: 2 rows, checksum=-3078609798 2014-12-09 19:26:22,812 2982 INFO dstdb: 2 rows, checksum=-3078609798 2014-12-09 19:26:22,866 2982 INFO Checking if node1 can be used for copy 2014-12-09 19:26:22,877 2982 INFO Node node1 seems good source, using it 2014-12-09 19:26:22,878 2982 INFO public.pgbench_history: Using node node1 as provider 2014-12-09 19:26:22,919 2982 INFO Provider: node1 (root) 2014-12-09 19:26:22,931 2982 INFO Locking public.pgbench_history 2014-12-09 19:26:22,932 2982 INFO Syncing public.pgbench_history 2014-12-09 19:26:25,963 2982 INFO Counting public.pgbench_history 2014-12-09 19:26:26,008 2982 INFO srcdb: 715 rows, checksum=9467587272 2014-12-09 19:26:26,020 2982 INFO dstdb: 715 rows, checksum=9467587272 2014-12-09 19:26:26,056 2982 INFO Checking if node1 can be used for copy 2014-12-09 19:26:26,063 2982 INFO Node node1 seems good source, using it 2014-12-09 19:26:26,064 2982 INFO public.pgbench_tellers: Using node node1 as provider 2014-12-09 19:26:26,100 2982 INFO Provider: node1 (root) 2014-12-09 19:26:26,108 2982 INFO Locking public.pgbench_tellers 2014-12-09 19:26:26,109 2982 INFO Syncing public.pgbench_tellers 2014-12-09 19:26:29,144 2982 INFO Counting public.pgbench_tellers 2014-12-09 19:26:29,176 2982 INFO srcdb: 20 rows, checksum=4814381032 2014-12-09 19:26:29,182 2982 INFO dstdb: 20 rows, checksum=4814381032 How it works... The following is an explanation of the steps performed in the preceding section: Initially, in step 1, we create two databases, that is node1 and node2, that are used as the source and target databases, respectively, from a replication perspective. In step 2, we populate the node1 database using the pgbench utility. In step 3 of the preceding section, we add and define the respective primary key and foreign key relationships on different tables and put these DDL commands in a .sql file. In step 4, we execute these DDL commands stated in step 3 on the node1 database; thus, in this way, we force the primary key and foreign key definitions on the tables in the pgbench schema in the node1 database. In step 5, we extract the table definitions from the tables in the pgbench schema in the node1 database and load these definitions in the node2 database. We will now discuss steps 6 to 8 of the preceding section. In step 6, we create the configuration file, which is then used in step 7 to create the root node for the node1 source database. In step 8, we will launch the worker daemon for the root node. Regarding the entries mentioned in the configuration file in step 6, we first define a job that must have a name, so that distinguished processes can be easily identified. Then, we define a connect string with information to connect to the source database, that is node1, and then we define the name of the replication queue involved. Finally, we define the location of the log and pid files. We will now discuss steps 9 to 11 of the preceding section. In step 9, we define the configuration file, which is then used in step 10 to create the leaf node for the target database, that is node2. In step 11, we launch the worker daemon for the leaf node. The entries in the configuration file in step 9 contain the job_name connect string in order to connect to the target database, that is node2, the name of the replication queue involved, and the location of log and pid involved. The key part in step 11 is played by the slave, that is the target database—to find the master or provider, that is source database node1. We will now talk about steps 12 and 13 of the preceding section. In step 12, we define the ticker configuration, with the help of which we launch the ticker process mentioned in step 13. Once the ticker daemon has started successfully, we have all the components and processes setup and needed for replication; however, we have not yet defined what the system needs to replicate. In step 14 and 15, we define the tables to the replication that is set on both the source and target databases, that is node1 and node2, respectively. Finally, we will talk about steps 16 and 17 of the preceding section. Here, at this stage, we are testing the replication that was set up between the node1 source database and the node2 target database. In step 16, we generate some traffic on the node1 source database by running pgbench with five parallel database connections and generating traffic for 10 seconds. In step 17, we check whether the tables on both the source and target databases have the same data. For this purpose, we use the compare command on the provider and subscriber nodes and then count and checksum the rows on both sides. A partial output from the preceding section tells you that the data has been successfully replicated between all the tables that are part of the replication set up between the node1 source database and the node2 destination database, as the count and checksum of rows for all the tables on the source and target destination databases are matching: 2014-12-09 19:26:18,975 2982 INFO Counting public.pgbench_accounts 2014-12-09 19:26:19,401 2982 INFO srcdb: 200000 rows, checksum=167607238449 2014-12-09 19:26:19,706 2982 INFO dstdb: 200000 rows, checksum=167607238449   2014-12-09 19:26:22,772 2982 INFO Counting public.pgbench_branches 2014-12-09 19:26:22,804 2982 INFO srcdb: 2 rows, checksum=-3078609798 2014-12-09 19:26:22,812 2982 INFO dstdb: 2 rows, checksum=-3078609798   2014-12-09 19:26:25,963 2982 INFO Counting public.pgbench_history 2014-12-09 19:26:26,008 2982 INFO srcdb: 715 rows, checksum=9467587272 2014-12-09 19:26:26,020 2982 INFO dstdb: 715 rows, checksum=9467587272   2014-12-09 19:26:29,144 2982 INFO Counting public.pgbench_tellers 2014-12-09 19:26:29,176 2982 INFO srcdb: 20 rows, checksum=4814381032 2014-12-09 19:26:29,182 2982 INFO dstdb: 20 rows, checksum=4814381032 Summary This article demonstrates the high availability and replication concepts in PostgreSQL. After reading this chapter, you will be able to implement high availability and replication options using different techniques including streaming replication, Slony-I replication and replication using Longdiste. Resources for Article: Further resources on this subject: Running a PostgreSQL Database Server [article] Securing the WAL Stream [article] Recursive queries [article]
Read more
  • 0
  • 0
  • 5507

article-image-threejs-materials-and-texture
Packt
06 Feb 2015
11 min read
Save for later

Three.js - Materials and Texture

Packt
06 Feb 2015
11 min read
In this article by Jos Dirksen author of the book Three.js Cookbook, we will learn how Three.js offers a large number of different materials and supports many different types of textures. These textures provide a great way to create interesting effects and graphics. In this article, we'll show you recipes that allow you to get the most out of these components provided by Three.js. (For more resources related to this topic, see here.) Using HTML canvas as a texture Most often when you use textures, you use static images. With Three.js, however, it is also possible to create interactive textures. In this recipe, we will show you how you can use an HTML5 canvas element as an input for your texture. Any change to this canvas is automatically reflected after you inform Three.js about this change in the texture used on the geometry. Getting ready For this recipe, we need an HTML5 canvas element that can be displayed as a texture. We can create one ourselves and add some output, but for this recipe, we've chosen something else. We will use a simple JavaScript library, which outputs a clock to a canvas element. The resulting mesh will look like this (see the 04.03-use-html-canvas-as-texture.html example): The JavaScript used to render the clock was based on the code from this site: http://saturnboy.com/2013/10/html5-canvas-clock/. To include the code that renders the clock in our page, we need to add the following to the head element: <script src="../libs/clock.js"></script> How to do it... To use a canvas as a texture, we need to perform a couple of steps: The first thing we need to do is create the canvas element: var canvas = document.createElement('canvas'); canvas.width=512; canvas.height=512; Here, we create an HTML canvas element programmatically and define a fixed width. Now that we've got a canvas, we need to render the clock that we use as the input for this recipe on it. The library is very easy to use; all you have to do is pass in the canvas element we just created: clock(canvas); At this point, we've got a canvas that renders and updates an image of a clock. What we need to do now is create a geometry and a material and use this canvas element as a texture for this material: var cubeGeometry = new THREE.BoxGeometry(10, 10, 10); var cubeMaterial = new THREE.MeshLambertMaterial(); cubeMaterial.map = new THREE.Texture(canvas); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); To create a texture from a canvas element, all we need to do is create a new instance of THREE.Texture and pass in the canvas element we created in step 1. We assign this texture to the cubeMaterial.map property, and that's it. If you run the recipe at this step, you might see the clock rendered on the sides of the cubes. However, the clock won't update itself. We need to tell Three.js that the canvas element has been changed. We do this by adding the following to the rendering loop: cubeMaterial.map.needsUpdate = true; This informs Three.js that our canvas texture has changed and needs to be updated the next time the scene is rendered. With these four simple steps, you can easily create interactive textures and use everything you can create on a canvas element as a texture in Three.js. How it works... How this works is actually pretty simple. Three.js uses WebGL to render scenes and apply textures. WebGL has native support for using HTML canvas element as textures, so Three.js just passes on the provided canvas element to WebGL and it is processed as any other texture. Making part of an object transparent You can create a lot of interesting visualizations using the various materials available with Three.js. In this recipe, we'll look at how you can use the materials available with Three.js to make part of an object transparent. This will allow you to create complex-looking geometries with relative ease. Getting ready Before we dive into the required steps in Three.js, we first need to have the texture that we will use to make an object partially transparent. For this recipe, we will use the following texture, which was created in Photoshop: You don't have to use Photoshop; the only thing you need to keep in mind is that you use an image with a transparent background. Using this texture, in this recipe, we'll show you how you can create the following (04.08-make-part-of-object-transparent.html): As you can see in the preceeding, only part of the sphere is visible, and you can look through the sphere to see the back at the other side of the sphere. How to do it... Let's look at the steps you need to take to accomplish this: The first thing we do is create the geometry. For this recipe, we use THREE.SphereGeometry: var sphereGeometry = new THREE.SphereGeometry(6, 20, 20); Just like all the other recipes, you can use whatever geometry you want. In the second step, we create the material: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.side = THREE.DoubleSide; mat.depthWrite = false; mat.color = new THREE.Color(0xff0000); As you can see in this fragment, we create THREE.MeshPhongMaterial and load the texture we saw in the Getting ready section of this recipe. To render this correctly, we also need to set the side property to THREE.DoubleSide so that the inside of the sphere is also rendered, and we need to set the depthWrite property to false. This will tell WebGL that we still want to test our vertices against the WebGL depth buffer, but we don't write to it. Often, you need to set this to false when working with more complex transparent objects or particles. Finally, add the sphere to the scene: var sphere = new THREE.Mesh(sphereGeometry, mat); scene.add(sphere); With these simple steps, you can create really interesting effects by just experimenting with textures and geometries. There's more With Three.js, it is possible to repeat textures (refer to the Setup repeating textures recipe). You can use this to create interesting-looking objects such as this: The code required to set a texture to repeat is the following: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.map.wrapS = mat.map.wrapT = THREE.RepeatWrapping; mat.map.repeat.set( 4, 4 ); mat.depthWrite = false; mat.color = new THREE.Color(0x00ff00); By changing the mat.map.repeat.set values, you define how often the texture is repeated. Using a cubemap to create reflective materials With the approach Three.js uses to render scenes in real time, it is difficult and very computationally intensive to create reflective materials. Three.js, however, provides a way you can cheat and approximate reflectivity. For this, Three.js uses cubemaps. In this recipe, we'll explain how to create cubemaps and use them to create reflective materials. Getting ready A cubemap is a set of six images that can be mapped to the inside of a cube. They can be created from a panorama picture and look something like this: In Three.js, we map such a map on the inside of a cube or sphere and use that information to calculate reflections. The following screenshot (example 04.10-use-reflections.html) shows what this looks like when rendered in Three.js: As you can see in the preceeding screenshot, the objects in the center of the scene reflect the environment they are in. This is something often called a skybox. To get ready, the first thing we need to do is get a cubemap. If you search on the Internet, you can find some ready-to-use cubemaps, but it is also very easy to create one yourself. For this, go to http://gonchar.me/panorama/. On this page, you can upload a panoramic picture and it will be converted to a set of pictures you can use as a cubemap. For this, perform the following steps: First, get a 360 degrees panoramic picture. Once you have one, upload it to the http://gonchar.me/panorama/ website by clicking on the large OPEN button:  Once uploaded, the tool will convert the panorama picture to a cubemap as shown in the following screenshot:  When the conversion is done, you can download the various cube map sites. The recipe in this book uses the naming convention provided by Cube map sides option, so download them. You'll end up with six images with names such as right.png, left.png, top.png, bottom.png, front.png, and back.png. Once you've got the sides of the cubemap, you're ready to perform the steps in the recipe. How to do it... To use the cubemap we created in the previous section and create reflecting material,we need to perform a fair number of steps, but it isn't that complex: The first thing you need to do is create an array from the cubemap images you downloaded: var urls = [ '../assets/cubemap/flowers/right.png', '../assets/cubemap/flowers/left.png', '../assets/cubemap/flowers/top.png', '../assets/cubemap/flowers/bottom.png', '../assets/cubemap/flowers/front.png', '../assets/cubemap/flowers/back.png' ]; With this array, we can create a cubemap texture like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls); cubemap.format = THREE.RGBFormat; From this cubemap, we can use THREE.BoxGeometry and a custom THREE.ShaderMaterial object to create a skybox (the environment surrounding our meshes): var shader = THREE.ShaderLib[ "cube" ]; shader.uniforms[ "tCube" ].value = cubemap; var material = new THREE.ShaderMaterial( { fragmentShader: shader.fragmentShader, vertexShader: shader.vertexShader, uniforms: shader.uniforms, depthWrite: false, side: THREE.DoubleSide }); // create the skybox var skybox = new THREE.Mesh( new THREE.BoxGeometry( 10000, 10000, 10000 ), material ); scene.add(skybox); Three.js provides a custom shader (a piece of WebGL code) that we can use for this. As you can see in the code snippet, to use this WebGL code, we need to define a THREE.ShaderMaterial object. With this material, we create a giant THREE.BoxGeometry object that we add to scene. Now that we've created the skybox, we can define the reflecting objects: var sphereGeometry = new THREE.SphereGeometry(4,15,15); var envMaterial = new THREE.MeshBasicMaterial( {envMap:cubemap}); var sphere = new THREE.Mesh(sphereGeometry, envMaterial); As you can see, we also pass in the cubemap we created as a property (envmap) to the material. This informs Three.js that this object is positioned inside a skybox, defined by the images that make up cubemap. The last step is to add the object to the scene, and that's it: scene.add(sphere); In the example in the beginning of this recipe, you saw three geometries. You can use this approach with all different types of geometries. Three.js will determine how to render the reflective area. How it works... Three.js itself doesn't really do that much to render the cubemap object. It relies on a standard functionality provided by WebGL. In WebGL, there is a construct called samplerCube. With samplerCube, you can sample, based on a specific direction, which color matches the cubemap object. Three.js uses this to determine the color value for each part of the geometry. The result is that on each mesh, you can see a reflection of the surrounding cubemap using the WebGL textureCube function. In Three.js, this results in the following call (taken from the WebGL shader in GLSL): vec4 cubeColor = textureCube( tCube, vec3( -vReflect.x, vReflect.yz ) ); A more in-depth explanation on how this works can be found at http://codeflow.org/entries/2011/apr/18/advanced-webgl-part-3-irradiance-environment-map/#cubemap-lookup. There's more... In this recipe, we created the cubemap object by providing six separate images. There is, however, an alternative way to create the cubemap object. If you've got a 360 degrees panoramic image, you can use the following code to directly create a cubemap object from that image: var texture = THREE.ImageUtils.loadTexture( 360-degrees.png', new THREE.UVMapping()); Normally when you create a cubemap object, you use the code shown in this recipe to map it to a skybox. This usually gives the best results but requires some extra code. You can also use THREE.SphereGeometry to create a skybox like this: var mesh = new THREE.Mesh( new THREE.SphereGeometry( 500, 60, 40 ), new THREE.MeshBasicMaterial( { map: texture })); mesh.scale.x = -1; This applies the texture to a sphere and with mesh.scale, turns this sphere inside out. Besides reflection, you can also use a cubemap object for refraction (think about light bending through water drops or glass objects): All you have to do to make a refractive material is load the cubemap object like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls, new THREE.CubeRefractionMapping()); And define the material in the following way: var envMaterial = new THREE.MeshBasicMaterial({envMap:cubemap}); envMaterial.refractionRatio = 0.95; Summary In this article, we learned about the different textures and materials supported by Three.js Resources for Article:  Further resources on this subject: Creating the maze and animating the cube [article] Working with the Basic Components That Make Up a Three.js Scene [article] Mesh animation [article]
Read more
  • 0
  • 0
  • 25991

article-image-extending-elasticsearch-scripting
Packt
06 Feb 2015
21 min read
Save for later

Extending ElasticSearch with Scripting

Packt
06 Feb 2015
21 min read
In article by Alberto Paro, the author of ElasticSearch Cookbook Second Edition, we will cover about the following recipes: (For more resources related to this topic, see here.) Installing additional script plugins Managing scripts Sorting data using scripts Computing return fields with scripting Filtering a search via scripting Introduction ElasticSearch has a powerful way of extending its capabilities with custom scripts, which can be written in several programming languages. The most common ones are Groovy, MVEL, JavaScript, and Python. In this article, we will see how it's possible to create custom scoring algorithms, special processed return fields, custom sorting, and complex update operations on records. The scripting concept of ElasticSearch can be seen as an advanced stored procedures system in the NoSQL world; so, for an advanced usage of ElasticSearch, it is very important to master it. Installing additional script plugins ElasticSearch provides native scripting (a Java code compiled in JAR) and Groovy, but a lot of interesting languages are also available, such as JavaScript and Python. In older ElasticSearch releases, prior to version 1.4, the official scripting language was MVEL, but due to the fact that it was not well-maintained by MVEL developers, in addition to the impossibility to sandbox it and prevent security issues, MVEL was replaced with Groovy. Groovy scripting is now provided by default in ElasticSearch. The other scripting languages can be installed as plugins. Getting ready You will need a working ElasticSearch cluster. How to do it... In order to install JavaScript language support for ElasticSearch (1.3.x), perform the following steps: From the command line, simply enter the following command: bin/plugin --install elasticsearch/elasticsearch-lang-javascript/2.3.0 This will print the following result: -> Installing elasticsearch/elasticsearch-lang-javascript/2.3.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-lang-javascript/ elasticsearch-lang-javascript-2.3.0.zip... Downloading ....DONE Installed lang-javascript If the installation is successful, the output will end with Installed; otherwise, an error is returned. To install Python language support for ElasticSearch, just enter the following command: bin/plugin -install elasticsearch/elasticsearch-lang-python/2.3.0 The version number depends on the ElasticSearch version. Take a look at the plugin's web page to choose the correct version. How it works... Language plugins allow you to extend the number of supported languages to be used in scripting. During the ElasticSearch startup, an internal ElasticSearch service called PluginService loads all the installed language plugins. In order to install or upgrade a plugin, you need to restart the node. The ElasticSearch community provides common scripting languages (a list of the supported scripting languages is available on the ElasticSearch site plugin page at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html), and others are available in GitHub repositories (a simple search on GitHub allows you to find them). The following are the most commonly used languages for scripting: Groovy (http://groovy.codehaus.org/): This language is embedded in ElasticSearch by default. It is a simple language that provides scripting functionalities. This is one of the fastest available language extensions. Groovy is a dynamic, object-oriented programming language with features similar to those of Python, Ruby, Perl, and Smalltalk. It also provides support to write a functional code. JavaScript (https://github.com/elasticsearch/elasticsearch-lang-javascript): This is available as an external plugin. The JavaScript implementation is based on Java Rhino (https://developer.mozilla.org/en-US/docs/Rhino) and is really fast. Python (https://github.com/elasticsearch/elasticsearch-lang-python): This is available as an external plugin, based on Jython (http://jython.org). It allows Python to be used as a script engine. Considering several benchmark results, it's slower than other languages. There's more... Groovy is preferred if the script is not too complex; otherwise, a native plugin provides a better environment to implement complex logic and data management. The performance of every language is different; the fastest one is the native Java. In the case of dynamic scripting languages, Groovy is faster, as compared to JavaScript and Python. In order to access document properties in Groovy scripts, the same approach will work as in other scripting languages: doc.score: This stores the document's score. doc['field_name'].value: This extracts the value of the field_name field from the document. If the value is an array or if you want to extract the value as an array, you can use doc['field_name'].values. doc['field_name'].empty: This returns true if the field_name field has no value in the document. doc['field_name'].multivalue: This returns true if the field_name field contains multiple values. If the field contains a geopoint value, additional methods are available, as follows: doc['field_name'].lat: This returns the latitude of a geopoint. If you need the value as an array, you can use the doc['field_name'].lats method. doc['field_name'].lon: This returns the longitude of a geopoint. If you need the value as an array, you can use the doc['field_name'].lons method. doc['field_name'].distance(lat,lon): This returns the plane distance, in miles, from a latitude/longitude point. If you need to calculate the distance in kilometers, you should use the doc['field_name'].distanceInKm(lat,lon) method. doc['field_name'].arcDistance(lat,lon): This returns the arc distance, in miles, from a latitude/longitude point. If you need to calculate the distance in kilometers, you should use the doc['field_name'].arcDistanceInKm(lat,lon) method. doc['field_name'].geohashDistance(geohash): This returns the distance, in miles, from a geohash value. If you need to calculate the same distance in kilometers, you should use doc['field_name'] and the geohashDistanceInKm(lat,lon) method. By using these helper methods, it is possible to create advanced scripts in order to boost a document by a distance that can be very handy in developing geolocalized centered applications. Managing scripts Depending on your scripting usage, there are several ways to customize ElasticSearch to use your script extensions. In this recipe, we will see how to provide scripts to ElasticSearch via files, indexes, or inline. Getting ready You will need a working ElasticSearch cluster populated with the populate script (chapter_06/populate_aggregations.sh), available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... To manage scripting, perform the following steps: Dynamic scripting is disabled by default for security reasons; we need to activate it in order to use dynamic scripting languages such as JavaScript or Python. To do this, we need to turn off the disable flag (script.disable_dynamic: false) in the ElasticSearch configuration file (config/elasticseach.yml) and restart the cluster. To increase security, ElasticSearch does not allow you to specify scripts for non-sandbox languages. Scripts can be placed in the scripts directory inside the configuration directory. To provide a script in a file, we'll put a my_script.groovy script in the config/scripts location with the following code content: doc["price"].value * factor If the dynamic script is enabled (as done in the first step), ElasticSearch allows you to store the scripts in a special index, .scripts. To put my_script in the index, execute the following command in the command terminal: curl -XPOST localhost:9200/_scripts/groovy/my_script -d '{ "script":"doc["price"].value * factor" }' The script can be used by simply referencing it in the script_id field; use the following command: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script_id" : "my_script",      "lang" : "groovy",      "type" : "number",      "ignore_unmapped" : true,      "params" : {        "factor" : 1.1      },      "order" : "asc"    } } }' How it works... ElasticSearch allows you to load your script in different ways; each one of these methods has their pros and cons. The most secure way to load or import scripts is to provide them as files in the config/scripts directory. This directory is continuously scanned for new files (by default, every 60 seconds). The scripting language is automatically detected by the file extension, and the script name depends on the filename. If the file is put in subdirectories, the directory path becomes part of the filename; for example, if it is config/scripts/mysub1/mysub2/my_script.groovy, the script name will be mysub1_mysub2_my_script. If the script is provided via a filesystem, it can be referenced in the code via the "script": "script_name" parameter. Scripts can also be available in the special .script index. These are the REST end points: To retrieve a script, use the following code: GET http://<server>/_scripts/<language>/<id"> To store a script use the following code: PUT http://<server>/_scripts/<language>/<id> To delete a script use the following code: DELETE http://<server>/_scripts/<language>/<id> The indexed script can be referenced in the code via the "script_id": "id_of_the_script" parameter. The recipes that follow will use inline scripting because it's easier to use it during the development and testing phases. Generally, a good practice is to develop using the inline dynamic scripting in a request, because it's faster to prototype. Once the script is ready and no changes are needed, it can be stored in the index since it is simpler to call and manage. In production, a best practice is to disable dynamic scripting and store the script on the disk (generally, dumping the indexed script to disk). See also The scripting page on the ElasticSearch website at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html Sorting data using script ElasticSearch provides scripting support for the sorting functionality. In real world applications, there is often a need to modify the default sort by the match score using an algorithm that depends on the context and some external variables. Some common scenarios are given as follows: Sorting places near a point Sorting by most-read articles Sorting items by custom user logic Sorting items by revenue Getting ready You will need a working ElasticSearch cluster and an index populated with the script, which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to sort using scripting, perform the following steps: If you want to order your documents by the price field multiplied by a factor parameter (that is, sales tax), the search will be as shown in the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script" : "doc["price"].value * factor",      "lang" : "groovy",      "type" : "number",      "ignore_unmapped" : true,    "params" : {        "factor" : 1.1      },            "order" : "asc"        }    } }' In this case, we have used a match_all query and a sort script. If everything is correct, the result returned by ElasticSearch should be as shown in the following code: { "took" : 7, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 1000,    "max_score" : null,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "161",      "_score" : null, "_source" : … truncated …,      "sort" : [ 0.0278578661440021 ]    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "634",      "_score" : null, "_source" : … truncated …,     "sort" : [ 0.08131364254827411 ]    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "465",      "_score" : null, "_source" : … truncated …,      "sort" : [ 0.1094966959069832 ]    } ] } } How it works... The sort scripting allows you to define several parameters, as follows: order (default "asc") ("asc" or "desc"): This determines whether the order must be ascending or descending. script: This contains the code to be executed. type: This defines the type to convert the value. params (optional, a JSON object): This defines the parameters that need to be passed. lang (by default, groovy): This defines the scripting language to be used. ignore_unmapped (optional): This ignores unmapped fields in a sort. This flag allows you to avoid errors due to missing fields in shards. Extending the sort with scripting allows the use of a broader approach to score your hits. ElasticSearch scripting permits the use of every code that you want. You can create custom complex algorithms to score your documents. There's more... Groovy provides a lot of built-in functions (mainly taken from Java's Math class) that can be used in scripts, as shown in the following table: Function Description time() The current time in milliseconds sin(a) Returns the trigonometric sine of an angle cos(a) Returns the trigonometric cosine of an angle tan(a) Returns the trigonometric tangent of an angle asin(a) Returns the arc sine of a value acos(a) Returns the arc cosine of a value atan(a) Returns the arc tangent of a value toRadians(angdeg) Converts an angle measured in degrees to an approximately equivalent angle measured in radians toDegrees(angrad) Converts an angle measured in radians to an approximately equivalent angle measured in degrees exp(a) Returns Euler's number raised to the power of a value log(a) Returns the natural logarithm (base e) of a value log10(a) Returns the base 10 logarithm of a value sqrt(a) Returns the correctly rounded positive square root of a value cbrt(a) Returns the cube root of a double value IEEEremainder(f1, f2) Computes the remainder operation on two arguments, as prescribed by the IEEE 754 standard ceil(a) Returns the smallest (closest to negative infinity) value that is greater than or equal to the argument and is equal to a mathematical integer floor(a) Returns the largest (closest to positive infinity) value that is less than or equal to the argument and is equal to a mathematical integer rint(a) Returns the value that is closest in value to the argument and is equal to a mathematical integer atan2(y, x) Returns the angle theta from the conversion of rectangular coordinates (x,y_) to polar coordinates (r,_theta) pow(a, b) Returns the value of the first argument raised to the power of the second argument round(a) Returns the closest integer to the argument random() Returns a random double value abs(a) Returns the absolute value of a value max(a, b) Returns the greater of the two values min(a, b) Returns the smaller of the two values ulp(d) Returns the size of the unit in the last place of the argument signum(d) Returns the signum function of the argument sinh(x) Returns the hyperbolic sine of a value cosh(x) Returns the hyperbolic cosine of a value tanh(x) Returns the hyperbolic tangent of a value hypot(x,y) Returns sqrt(x^2+y^2) without an intermediate overflow or underflow acos(a) Returns the arc cosine of a value atan(a) Returns the arc tangent of a value If you want to retrieve records in a random order, you can use a script with a random method, as shown in the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script" : "Math.random()",      "lang" : "groovy",      "type" : "number",      "params" : {}    } } }' In this example, for every hit, the new sort value is computed by executing the Math.random() scripting function. See also The official ElasticSearch documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html Computing return fields with scripting ElasticSearch allows you to define complex expressions that can be used to return a new calculated field value. These special fields are called script_fields, and they can be expressed with a script in every available ElasticSearch scripting language. Getting ready You will need a working ElasticSearch cluster and an index populated with the script (chapter_06/populate_aggregations.sh), which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to compute return fields with scripting, perform the following steps: Return the following script fields: "my_calc_field": This concatenates the text of the "name" and "description" fields "my_calc_field2": This multiplies the "price" value by the "discount" parameter From the command line, execute the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/ _search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "script_fields" : {    "my_calc_field" : {      "script" : "doc["name"].value + " -- " + doc["description"].value"    },    "my_calc_field2" : {      "script" : "doc["price"].value * discount",      "params" : {       "discount" : 0.8      }    } } }' If everything works all right, this is how the result returned by ElasticSearch should be: { "took" : 4, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 1000,    "max_score" : 1.0,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "4",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "entropic -- accusantium",        "my_calc_field2" : 5.480038242170081      }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "9",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "frankie -- accusantium",        "my_calc_field2" : 34.79852410178313      }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "11",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "johansson -- accusamus",        "my_calc_field2" : 11.824173084636591      }    } ] } } How it works... The scripting fields are similar to executing an SQL function on a field during a select operation. In ElasticSearch, after a search phase is executed and the hits to be returned are calculated, if some fields (standard or script) are defined, they are calculated and returned. The script field, which can be defined with all the supported languages, is processed by passing a value to the source of the document and, if some other parameters are defined in the script (in the discount factor example), they are passed to the script function. The script function is a code snippet; it can contain everything that the language allows you to write, but it must be evaluated to a value (or a list of values). See also The Installing additional script plugins recipe in this article to install additional languages for scripting The Sorting using script recipe to have a reference of the extra built-in functions in Groovy scripts Filtering a search via scripting ElasticSearch scripting allows you to extend the traditional filter with custom scripts. Using scripting to create a custom filter is a convenient way to write scripting rules that are not provided by Lucene or ElasticSearch, and to implement business logic that is not available in the query DSL. Getting ready You will need a working ElasticSearch cluster and an index populated with the (chapter_06/populate_aggregations.sh) script, which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to filter a search using a script, perform the following steps: Write a search with a filter that filters out a document with the value of age less than the parameter value: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' In this example, all the documents in which the value of age is greater than param1 are qualified to be returned. If everything works correctly, the result returned by ElasticSearch should be as shown here: { "took" : 30, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 237,    "max_score" : 1.0,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "9",      "_score" : 1.0, "_source" :{ … "age": 83, … }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "23",      "_score" : 1.0, "_source" : { … "age": 87, … }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "47",      "_score" : 1.0, "_source" : {…. "age": 98, …}    } ] } } How it works... The script filter is a language script that returns a Boolean value (true/false). For every hit, the script is evaluated, and if it returns true, the hit passes the filter. This type of scripting can only be used as Lucene filters, not as queries, because it doesn't affect the search (the exceptions are constant_score and custom_filters_score). These are the scripting fields: script: This contains the code to be executed params: These are optional parameters to be passed to the script lang (defaults to groovy): This defines the language of the script The script code can be any code in your preferred and supported scripting language that returns a Boolean value. There's more... Other languages are used in the same way as Groovy. For the current example, I have chosen a standard comparison that works in several languages. To execute the same script using the JavaScript language, use the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "lang":"javascript",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' For Python, use the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "lang":"python",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' See also The Installing additional script plugins recipe in this article to install additional languages for scripting The Sorting data using script recipe in this article to get a reference of the extra built-in functions in Groovy scripts Summary In this article you have learnt the ways you can use scripting to extend the ElasticSearch functional capabilities using different programming languages. Resources for Article: Further resources on this subject: Indexing the Data [Article] Low-Level Index Control [Article] Designing Puppet Architectures [Article]
Read more
  • 0
  • 0
  • 8475

article-image-working-incanter-datasets
Packt
04 Feb 2015
28 min read
Save for later

Working with Incanter Datasets

Packt
04 Feb 2015
28 min read
In this article by Eric Rochester author of the book, Clojure Data Analysis Cookbook, Second Edition, we will cover the following recipes: Loading Incanter's sample datasets Loading Clojure data structures into datasets Viewing datasets interactively with view Converting datasets to matrices Using infix formulas in Incanter Selecting columns with $ Selecting rows with $ Filtering datasets with $where Grouping data with $group-by Saving datasets to CSV and JSON Projecting from multiple datasets with $join (For more resources related to this topic, see here.) Introduction Incanter combines the power to do statistics using a fully-featured statistical language such as R (http://www.r-project.org/) with the ease and joy of Clojure. Incanter's core data structure is the dataset, so we'll spend some time in this article to look at how to use them effectively. While learning basic tools in this manner is often not the most exciting way to spend your time, it can still be incredibly useful. At its most fundamental level, an Incanter dataset is a table of rows. Each row has the same set of columns, much like a spreadsheet. The data in each cell of an Incanter dataset can be a string or a numeric. However, some operations require the data to only be numeric. First you'll learn how to populate and view datasets, then you'll learn different ways to query and project the parts of the dataset that you're interested in onto a new dataset. Finally, we'll take a look at how to save datasets and merge multiple datasets together. Loading Incanter's sample datasets Incanter comes with a set of default datasets that are useful for exploring Incanter's functions. I haven't made use of them in this book, since there is so much data available in other places, but they're a great way to get a feel of what you can do with Incanter. Some of these datasets—for instance, the Iris dataset—are widely used to teach and test statistical algorithms. It contains the species and petal and sepal dimensions for 50 irises. This is the dataset that we'll access today. In this recipe, we'll load a dataset and see what it contains. Getting ready We'll need to include Incanter in our Leiningen project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include the right Incanter namespaces into our script or REPL: (use '(incanter core datasets)) How to do it… Once the namespaces are available, we can access the datasets easily: user=> (def iris (get-dataset :iris))#'user/iris user=> (col-names iris)[:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species]user=> (nrow iris)150 user=> (set ($ :Species iris))#{"versicolor" "virginica" "setosa"} How it works… We use the get-dataset function to access the built-in datasets. In this case, we're loading the Fisher's Iris dataset, sometimes called Anderson's dataset. This is a multivariate dataset for discriminant analysis. It gives petal and sepal measurements for 150 different Irises of three different species. Incanter's sample datasets cover a wide variety of topics—from U.S. arrests to plant growth and ultrasonic calibration. They can be used to test different algorithms and analyses and to work with different types of data. By the way, the names of functions should be familiar to you if you've previously used R. Incanter often uses the names of R's functions instead of using the Clojure names for the same functions. For example, the preceding code sample used nrow instead of count. There's more... Incanter's API documentation for get-dataset (http://liebke.github.com/incanter/datasets-api.html#incanter.datasets/get-dataset) lists more sample datasets, and you can refer to it for the latest information about the data that Incanter bundles. Loading Clojure data structures into datasets While they are good for learning, Incanter's built-in datasets probably won't be that useful for your work (unless you work with irises). Other recipes cover ways to get data from CSV files and other sources into Incanter. Incanter also accepts native Clojure data structures in a number of formats. We'll take look at a couple of these in this recipe. Getting ready We'll just need Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include this in our script or REPL: (use 'incanter.core) How to do it… The primary function used to convert data into a dataset is to-dataset. While it can convert single, scalar values into a dataset, we'll start with slightly more complicated inputs. Generally, you'll be working with at least a matrix. If you pass this to to-dataset, what do you get? user=> (def matrix-set (to-dataset [[1 2 3] [4 5 6]]))#'user/matrix-set user=> (nrow matrix-set)2user=> (col-names matrix-set)[:col-0 :col-1 :col-2] All the data's here, but it can be labeled in a better way. Does to-dataset handle maps? user=> (def map-set (to-dataset {:a 1, :b 2, :c 3}))#'user/map-set user=> (nrow map-set)1 user=> (col-names map-set)[:a :c :b] So, map keys become the column labels. That's much more intuitive. Let's throw a sequence of maps at it: user=> (def maps-set (to-dataset [{:a 1, :b 2, :c 3},                                 {:a 4, :b 5, :c 6}]))#'user/maps-setuser=> (nrow maps-set)2user=> (col-names maps-set)[:a :c :b] This is much more useful. We can also create a dataset by passing the column vector and the row matrix separately to dataset: user=> (def matrix-set-2         (dataset [:a :b :c]                         [[1 2 3] [4 5 6]]))#'user/matrix-set-2 user=> (nrow matrix-set-2)2 user=> (col-names matrix-set-2)[:c :b :a] How it works… The to-dataset function looks at the input and tries to process it intelligently. If given a sequence of maps, the column names are taken from the keys of the first map in the sequence. Ultimately, it uses the dataset constructor to create the dataset. When you want the most control, you should also use the dataset. It requires the dataset to be passed in as a column vector and a row matrix. When the data is in this format or when we need the most control—to rename the columns, for instance—we can use dataset. Viewing datasets interactively with view Being able to interact with our data programmatically is important, but sometimes it's also helpful to be able to look at it. This can be especially useful when you do data exploration. Getting ready We'll need to have Incanter in our project.clj file and script or REPL, so we'll use the same setup as we did for the Loading Incanter's sample datasets recipe, as follows. We'll also use the Iris dataset from that recipe. (use '(incanter core datasets)) How to do it… Incanter makes this very easy. Let's take a look at just how simple it is: First, we need to load the dataset, as follows: user=> (def iris (get-dataset :iris)) #'user/iris Then we just call view on the dataset: user=> (view iris) This function returns the Swing window frame, which contains our data, as shown in the following screenshot. This window should also be open on your desktop, although for me, it's usually hiding behind another window: How it works… Incanter's view function takes any object and tries to display it graphically. In this case, it simply displays the raw data as a table. Converting datasets to matrices Although datasets are often convenient, many times we'll want to treat our data as a matrix from linear algebra. In Incanter, matrices store a table of doubles. This provides good performance in a compact data structure. Moreover, we'll need matrices many times because some of Incanter's functions, such as trans, only operate on a matrix. Plus, it implements Clojure's ISeq interface, so interacting with matrices is also convenient. Getting ready For this recipe, we'll need the Incanter libraries, so we'll use this project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll use the core and io namespaces, so we'll load these into our script or REPL: (use '(incanter core io)) This line binds the file name to the identifier data-file: (def data-file "data/all_160_in_51.P35.csv") How to do it… For this recipe, we'll create a dataset, convert it to a matrix, and then perform some operations on it: First, we need to read the data into a dataset, as follows: (def va-data (read-dataset data-file :header true)) Then, in order to convert it to a matrix, we just pass it to the to-matrix function. Before we do this, we'll pull out a few of the columns since matrixes can only contain floating-point numbers: (def va-matrix    (to-matrix ($ [:POP100 :HU100 :P035001] va-data))) Now that it's a matrix, we can treat it like a sequence of rows. Here, we pass it to first in order to get the first row, take in order to get a subset of the matrix, and count in order to get the number of rows in the matrix: user=> (first va-matrix) A 1x3 matrix ------------- 8.19e+03 4.27e+03 2.06e+03   user=> (count va-matrix) 591 We can also use Incanter's matrix operators to get the sum of each column, for instance. The plus function takes each row and sums each column separately: user=> (reduce plus va-matrix) A 1x3 matrix ------------- 5.43e+06 2.26e+06 1.33e+06 How it works… The to-matrix function takes a dataset of floating-point values and returns a compact matrix. Matrices are used by many of Incanter's more sophisticated analysis functions, as they're easy to work with. There's more… In this recipe, we saw the plus matrix operator. Incanter defines a full suite of these. You can learn more about matrices and see what operators are available at https://github.com/liebke/incanter/wiki/matrices. Using infix formulas in Incanter There's a lot to like about lisp: macros, the simple syntax, and the rapid development cycle. Most of the time, it is fine if you treat math operators as functions and use prefix notations, which is a consistent, function-first syntax. This allows you to treat math operators in the same way as everything else so that you can pass them to reduce, or anything else you want to do. However, we're not taught to read math expressions using prefix notations (with the operator first). And especially when formulas get even a little complicated, tracing out exactly what's happening can get hairy. Getting ready For this recipe we'll just need Incanter in our project.clj file, so we'll use the dependencies statement—as well as the use statement—from the Loading Clojure data structures into datasets recipe. For data, we'll use the matrix that we created in the Converting datasets to matrices recipe. How to do it… Incanter has a macro that converts a standard math notation to a lisp notation. We'll explore that in this recipe: The $= macro changes its contents to use an infix notation, which is what we're used to from math class: user=> ($= 7 * 4)28user=> ($= 7 * 4 + 3)31 We can also work on whole matrixes or just parts of matrixes. In this example, we perform a scalar multiplication of the matrix: user=> ($= va-matrix * 4)A 591x3 matrix---------------3.28e+04 1.71e+04 8.22e+03 2.08e+03 9.16e+02 4.68e+02 1.19e+03 6.52e+02 3.08e+02...1.41e+03 7.32e+02 3.72e+02 1.31e+04 6.64e+03 3.49e+03 3.02e+04 9.60e+03 6.90e+03 user=> ($= (first va-matrix) * 4)A 1x3 matrix-------------3.28e+04 1.71e+04 8.22e+03 Using this, we can build complex expressions, such as this expression that takes the mean of the values in the first row of the matrix: user=> ($= (sum (first va-matrix)) /           (count (first va-matrix)))4839.333333333333 Or we can build expressions take the mean of each column, as follows: user=> ($= (reduce plus va-matrix) / (count va-matrix))A 1x3 matrix-------------9.19e+03 3.83e+03 2.25e+03 How it works… Any time you're working with macros and you wonder how they work, you can always get at their output expressions easily, so you can see what the computer is actually executing. The tool to do this is macroexpand-1. This expands the macro one step and returns the result. It's sibling function, macroexpand, expands the expression until there is no macro expression left. Usually, this is more than we want, so we just use macroexpand-1. Let's see what these macros expand into: user=> (macroexpand-1 '($= 7 * 4))(incanter.core/mult 7 4)user=> (macroexpand-1 '($= 7 * 4 + 3))(incanter.core/plus (incanter.core/mult 7 4) 3)user=> (macroexpand-1 '($= 3 + 7 * 4))(incanter.core/plus 3 (incanter.core/mult 7 4)) Here, we can see that the expression doesn't expand into Clojure's * or + functions, but it uses Incanter's matrix functions, mult and plus, instead. This allows it to handle a variety of input types, including matrices, intelligently. Otherwise, it switches around the expressions the way we'd expect. Also, we can see by comparing the last two lines of code that it even handles operator precedence correctly. Selecting columns with $ Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns. The Incanter macro $ slices out parts of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll need to have Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                [org.clojure/data.csv "0.1.2"]]) We'll also need to include these libraries in our script or REPL: (require '[clojure.java.io :as io]         '[clojure.data.csv :as csv]         '[clojure.string :as str]         '[incanter.core :as i]) Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name. How to do it… We'll use the $ macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2: Before we start, we'll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those: (defn with-header [coll] (let [headers (map #(keyword (str/replace % space -))                      (first coll))]    (map (partial zipmap headers) (next coll))))   (defn read-country-data [filename] (with-open [r (io/reader filename)]    (i/to-dataset      (doall (with-header                (drop 2 (csv/read-csv r))))))) Now, using these functions, we can load the data: user=> (def chn-data (read-country-data data-file)) We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column: user=> (i/$ :Indicator-Code chn-data) ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" … We can select more than one column by listing all of them in a vector. This time, the results are in a dataset: user=> (i/$ [:Indicator-Code :1992] chn-data)   |           :Indicator-Code |               :1992 | |---------------------------+---------------------| |           AG.AGR.TRAC.NO |             770629 | |         AG.CON.FERT.PT.ZS |                     | |           AG.CON.FERT.ZS |                     | |           AG.LND.AGRI.K2 |             5159980 | … We can list as many columns as we want, although the formatting might suffer: user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)   |           :Indicator-Code |               :1992 |               :2002 | |---------------------------+---------------------+---------------------| |           AG.AGR.TRAC.NO |            770629 |                     | |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 | |           AG.CON.FERT.ZS |                     |   373.087159048868 | |           AG.LND.AGRI.K2 |             5159980 |             5231970 | … How it works… The $ function is just a wrapper over Incanter's sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis. There's more… The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too: user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)   |   :Indicator-Code |                                               :Indicator-Name | |-------------------+---------------------------------------------------------------| |   AG.AGR.TRAC.NO |                             Agricultural machinery, tractors | | AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) | |   AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) | … See also… For information on how to pull out specific rows, see the next recipe, Selecting rows with $. Selecting rows with $ The Incanter macro $ also pulls rows out of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Similar to how we use $ in order to select columns, there are several ways in which we can use it to select rows, shown as follows: We can create a sequence of the values of one row using $, and pass it the index of the row we want as well as passing :all for the columns: user=> (i/$ 0 :all chn-data) ("AG.AGR.TRAC.NO" "684290" "738526" "52661" "" "880859" "" "" "" "59657" "847916" "862078" "891170" "235524" "126440" "469106" "282282" "817857" "125442" "703117" "CHN" "66290" "705723" "824113" "" "151281" "669675" "861364" "559638" "191220" "180772" "73021" "858031" "734325" "Agricultural machinery, tractors" "100432" "" "796867" "" "China" "" "" "155602" "" "" "770629" "747900" "346786" "" "398946" "876470" "" "795713" "" "55360" "685202" "989139" "798506" "") We can also pull out a dataset containing multiple rows by passing more than one index into $ with a vector (There's a lot of data, even for three rows, so I won't show it here): (i/$ (range 3) :all chn-data) We can also combine the two ways to slice data in order to pull specific columns and rows. We can either pull out a single row or multiple rows: user=> (i/$ 0 [:Indicator-Code :1992] chn-data) ("AG.AGR.TRAC.NO" "770629") user=> (i/$ (range 3) [:Indicator-Code :1992] chn-data)   |   :Indicator-Code | :1992 | |-------------------+--------| |   AG.AGR.TRAC.NO | 770629 | | AG.CON.FERT.PT.ZS |       | |   AG.CON.FERT.ZS |       | How it works… The $ macro is the workhorse used to slice rows and project (or select) columns from datasets. When it's called with two indexing parameters, the first is the row or rows and the second is the column or columns. Filtering datasets with $where While we can filter datasets before we import them into Incanter, Incanter makes it easy to filter and create new datasets from the existing ones. We'll take a look at its query language in this recipe. Getting ready We'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Once we have the data, we query it using the $where function: For example, this creates a dataset with a row for the percentage of China's total land area that is used for agriculture: user=> (def land-use          (i/$where {:Indicator-Code "AG.LND.AGRI.ZS"}                    chn-data)) user=> (i/nrow land-use) 1 user=> (i/$ [:Indicator-Code :2000] land-use) ("AG.LND.AGRI.ZS" "56.2891584865366") The queries can be more complicated too. This expression picks out the data that exists for 1962 by filtering any empty strings in that column: user=> (i/$ (range 5) [:Indicator-Code :1962]          (i/$where {:1962 {:ne ""}} chn-data))   |   :Indicator-Code |             :1962 | |-------------------+-------------------| |   AG.AGR.TRAC.NO |             55360 | |   AG.LND.AGRI.K2 |           3460010 | |   AG.LND.AGRI.ZS | 37.0949187612906 | |   AG.LND.ARBL.HA |         103100000 | | AG.LND.ARBL.HA.PC | 0.154858284392508 | Incanter's query language is even more powerful than this, but these examples should show you the basic structure and give you an idea of the possibilities. How it works… To better understand how to use $where, let's break apart the last example: ($i/where {:1962 {:ne ""}} chn-data) The query is expressed as a hashmap from fields to values (highlighted). As we saw in the first example, the value can be a raw value, either a literal or an expression. This tests for inequality. ($i/where {:1962 {:ne ""}} chn-data) Each test pair is associated with a field in another hashmap (highlighted). In this example, both the hashmaps shown only contain one key-value pair. However, they might contain multiple pairs, which will all be ANDed together. Incanter supports a number of test operators. The basic boolean tests are :$gt (greater than), :$lt (less than), :$gte (greater than or equal to), :$lte (less than or equal to), :$eq (equal to), and :$ne (not equal). There are also some operators that take sets as parameters: :$in and :$nin (not in). The last operator—:$fn—is interesting. It allows you to use any predicate function. For example, this will randomly select approximately half of the dataset: (def random-half (i/$where {:Indicator-Code {:$fn (fn [_] (< (rand) 0.5))}}            chnchn-data)) There's more… For full details of the query language, see the documentation for incanter.core/query-dataset (http://liebke.github.com/incanter/core-api.html#incanter.core/query-dataset). Grouping data with $group-by Datasets often come with an inherent structure. Two or more rows might have the same value in one column, and we might want to leverage that by grouping those rows together in our analysis. Getting ready First, we'll need to declare a dependency on Incanter in the project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                  [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) Next, we'll include Incanter core and io in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]) For data, we'll use the census race data for all the states. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. These lines will load the data into the race-data name: (def data-file "data/all_160.P3.csv") (def race-data (i-io/read-dataset data-file :header true)) How to do it… Incanter lets you group rows for further analysis or to summarize them with the $group-by function. All you need to do is pass the data to $group-by with the column or function to group on: (def by-state (i/$group-by :STATE race-data)) How it works… This function returns a map where each key is a map of the fields and values represented by that grouping. For example, this is how the keys look: user=> (take 5 (keys by-state)) ({:STATE 29} {:STATE 28} {:STATE 31} {:STATE 30} {:STATE 25}) We can get the data for Virginia back out by querying the group map for state 51. user=> (i/$ (range 3) [:GEOID :STATE :NAME :POP100]            (by-state {:STATE 51}))   | :GEOID | :STATE |         :NAME | :POP100 | |---------+--------+---------------+---------| | 5100148 |     51 | Abingdon town |   8191 | | 5100180 |     51 | Accomac town |     519 | | 5100724 |     51 | Alberta town |     298 | Saving datasets to CSV and JSON Once you've done the work of slicing, dicing, cleaning, and aggregating your datasets, you might want to save them. Incanter by itself doesn't have a good way to do this. However, with the help of some Clojure libraries, it's not difficult at all. Getting ready We'll need to include a number of dependencies in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                 [org.clojure/data.csv "0.1.2"]                 [org.clojure/data.json "0.2.5"]]) We'll also need to include these libraries in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]          '[clojure.data.csv :as csv]          '[clojure.data.json :as json]          '[clojure.java.io :as io]) Also, we'll use the same data that we introduced in the Selecting columns with $ recipe. How to do it… This process is really as simple as getting the data and saving it. We'll pull out the data for the year 2000 from the larger dataset. We'll use this subset of the data in both the formats here: (def data2000 (i/$ [:Indicator-Code :Indicator-Name :2000] chn-data)) Saving data as CSV To save a dataset as a CSV, all in one statement, open a file and use clojure.data.csv/write-csv to write the column names and data to it: (with-open [f-out (io/writer "data/chn-2000.csv")] (csv/write-csv f-out [(map name (i/col-names data2000))]) (csv/write-csv f-out (i/to-list data2000))) Saving data as JSON To save a dataset as JSON, open a file and use clojure.data.json/write to serialize the file: (with-open [f-out (io/writer "data/chn-2000.json")] (json/write (:rows data2000) f-out)) How it works… For CSV and JSON, as well as many other data formats, the process is very similar. Get the data, open the file, and serialize data into it. There will be differences in how the output function wants the data (to-list or :rows), and there will be differences in how the output function is called (for instance, whether the file handle is the first or second argument). But generally, outputting datasets will be very similar and relatively simple. Projecting from multiple datasets with $join So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together. Getting ready First, we'll need to include these dependencies in our project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) We'll use these statements for inclusions: (require '[clojure.java.io :as io]          '[clojure.data.csv :as csv]          '[clojure.string :as str]          '[incanter.core :as i]) For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank. How to do it… In this recipe, we'll take a look at how to join two datasets using Incanter: To begin with, we'll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We'll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe: (def data-file "data/chn/chn_Country_en_csv_v2.csv") (def chn-data (read-country-data data-file)) Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let's first pull out the data from 2 years into separate datasets. Note that for the second dataset, we'll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000): (def chn-1990 (i/$ [:Indicator-Code :Indicator-Name :1990]        chn-data)) (def chn-2000 (i/$ [:Indicator-Code :2000] chn-data)) Now, we'll join these datasets back together. This is contrived, but it's easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries: (def chn-decade (i/$join [:Indicator-Code :Indicator-Code]            chn-1990 chn-2000)) From this point on, we can use chn-decade just as we use any other Incanter dataset. How it works… Let's take a look at this in more detail: (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000) The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000). This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets. Summary In this article we have covered covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively. Resources for Article: Further resources on this subject: The Hunt for Data [article] Limits of Game Data Analysis [article] Clojure for Domain-specific Languages - Design Concepts with Clojure [article]
Read more
  • 0
  • 0
  • 3693
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-seeing-heartbeat-motion-amplifying-camera
Packt
04 Feb 2015
46 min read
Save for later

Seeing a Heartbeat with a Motion Amplifying Camera

Packt
04 Feb 2015
46 min read
Remove everything that has no relevance to the story. If you say in the first chapter that there is a rifle hanging on the wall, in the second or third chapter it absolutely must go off. If it's not going to be fired, it shouldn't be hanging there.                                                                                              —Anton Chekhov King Julian: I don't know why the sacrifice didn't work. The science seemed so solid.                                                                                             —Madagascar: Escape 2 Africa (2008) Despite their strange design and mysterious engineering, Q's gadgets always prove useful and reliable. Bond has such faith in the technology that he never even asks how to charge the batteries. One of the inventive ideas in the Bond franchise is that even a lightly equipped spy should be able to see and photograph concealed objects, anyplace, anytime. Let's consider a timeline of a few relevant gadgets in the movies: 1967 (You Only Live Twice): An X-ray desk scans guests for hidden firearms. 1979 (Moonraker): A cigarette case contains an X-ray imaging system that is used to reveal the tumblers of a safe's combination lock. 1989 (License to Kill): A Polaroid camera takes X-ray photos. Oddly enough, its flash is a visible, red laser. 1995 (GoldenEye): A tea tray contains an X-ray scanner that can photograph documents beneath the tray. 1999 (The World is Not Enough): Bond wears a stylish pair of blue-lensed glasses that can see through one layer of clothing to reveal concealed weapons. According to the James Bond Encyclopedia (2007), which is an official guide to the movies, the glasses display infrared video after applying special processing to it. Despite using infrared, they are commonly called X-ray specs, a misnomer. These gadgets deal with unseen wavelengths of light (or radiation) and are broadly comparable to real-world devices such as airport security scanners and night vision goggles. However, it remains difficult to explain how Bond's equipment is so compact and how it takes such clear pictures in diverse lighting conditions and through diverse materials. Moreover, if Bond's devices are active scanners (meaning they emit X-ray radiation or infrared light), they will be clearly visible to other spies using similar hardware. To take another approach, what if we avoid unseen wavelengths of light but instead focus on unseen frequencies of motion? Many things move in a pattern that is too fast or too slow for us to easily notice. Suppose that a man is standing in one place. If he shifts one leg more than the other, perhaps he is concealing a heavy object, such as a gun, on the side that he shifts more. We also might fail to notice deviations from a pattern. Suppose the same man has been looking straight ahead but suddenly, when he believes no one is looking, his eyes dart to one side. Is he watching someone? We can make motions of a certain frequency more visible by repeating them, like a delayed afterimage or a ghost, with each repetition being more faint (less opaque) than the last. The effect is analogous to an echo or a ripple, and it is achieved using an algorithm called Eulerian video magnification. By applying this technique in this article by Joseph Howse, author of the book OpenCV for Secret Agents, we will build a desktop app that allows us to simultaneously see the present and selected slices of the past. The idea of experiencing multiple images simultaneously is, to me, quite natural because for the first 26 years of my life, I had strabismus—commonly called a lazy eye—that caused double vision. A surgeon corrected my eyesight and gave me depth perception but, in memory of strabismus, I would like to name this application Lazy Eyes. (For more resources related to this topic, see here.) Let's take a closer look—or two or more closer looks—at the fast-paced, moving world that we share with all the other secret agents. Planning the Lazy Eyes app Of all our apps, Lazy Eyes has the simplest user interface. It just shows a live video feed with a special effect that highlights motion. The parameters of the effect are quite complex and, moreover, modifying them at runtime would have a big effect on performance. Thus, we do not provide a user interface to reconfigure the effect, but we do provide many parameters in code to allow a programmer to create many variants of the effect and the app. The following is a screenshot illustrating one configuration of the app. This image shows me eating cake. My hands and face are moving often and we see an effect that looks like light and dark waves rippling around the places where moving edges have been. (The effect is more graceful in a live video than in a screenshot.) For more screenshots and an in-depth discussion of the parameters, refer to the section Configuring and testing the app for various motions, later in this article. Regardless of how it is configured, the app loops through the following actions: Capturing an image. Copying and downsampling the image while applying a blur filter and optionally an edge finding filter. We will downsample using so-called image pyramids, which will be discussed in Compositing two images using image pyramids, later in this article. The purpose of downsampling is to achieve a higher frame rate by reducing the amount of image data used in subsequent operations. The purpose of applying a blur filter and optionally an edge finding filter is to create haloes that are useful in amplifying motion. Storing the downsampled copy in a history of frames, with a timestamp. The history has a fixed capacity and once it is full, the oldest frame is overwritten to make room for the new one. If the history is not yet full, we continue to the next iteration of the loop. Decomposing the history into a list of frequencies describing fluctuations (motion) at each pixel. The decomposition function is called a Fast Fourier Transform. We will discuss this in the Extracting repeating signals from video using the Fast Fourier Transform section, later in this article. Setting all frequencies to zero except a certain chosen range that interests us. In other words, filter out the data on motions that are faster or slower than certain thresholds. Recomposing the filtered frequencies into a series of images that are motion maps. Areas that are still (with respect to our chosen range of frequencies) become dark, and areas that are moving might remain bright. The recomposition function is called an Inverse Fast Fourier Transform (IFFT), and we will discuss it later alongside the FFT. Upsampling the latest motion map (again using image pyramids), intensifying it, and overlaying it additively atop the original camera image. Showing the resulting composite image. There it is—a simple plan that requires a rather nuanced implementation and configuration. Let's prepare ourselves by doing a little background research. Understanding what Eulerian video magnification can do Eulerian video magnification is inspired by a model in fluid mechanics called Eulerian specification of the flow field. Let's consider a moving, fluid body, such as a river. The Eulerian specification describes the river's velocity at a given position and time. The velocity would be fast in the mountains in springtime and slow at the river's mouth in winter. Also, the velocity would be slower at a silt-saturated point at the river's bottom, compared to a point where the river's surface hits a rock and sprays. An alternative to the Eulerian specification is the Lagrangian specification, which describes the position of a given particle at a given time. A given bit of silt might make its way down from the mountains to the river's mouth over a period of many years and then spend eons drifting around a tidal basin. For a more formal description of the Eulerian specification, the Lagrangian specification, and their relationship, refer to this Wikipedia article http://en.wikipedia.org/wiki/Lagrangian_and_Eulerian_specification_of_the_flow_field. The Lagrangian specification is analogous to many computer vision tasks, in which we model the motion of a particular object or feature over time. However, the Eulerian specification is analogous to our current task, in which we model any motion occurring in a particular position and a particular window of time. Having modeled a motion from an Eulerian perspective, we can visually exaggerate the motion by overlaying the model's results for a blend of positions and times. Let's set a baseline for our expectations of Eulerian video magnification by studying other people's projects: Michael Rubenstein's webpage at MIT (http://people.csail.mit.edu/mrub/vidmag/) gives an abstract and demo videos of his team's pioneering work on Eulerian video magnification. Bryce Drennan's Eulerian-magnification library (https://github.com/brycedrennan/eulerian-magnification) implements the algorithm using NumPy, SciPy, and OpenCV. This implementation is good inspiration for us, but it is designed to process prerecorded videos and is not sufficiently optimized for real-time input. Now, let's discuss the functions that are building blocks of these projects and ours. Extracting repeating signals from video using the Fast Fourier Transform (FFT) An audio signal is typically visualized as a bar chart or wave. The bar chart or wave is high when the sound is loud and low when it is soft. We recognize that a repetitive sound, such as a metronome's beat, makes repetitive peaks and valleys in the visualization. When audio has multiple channels (being a stereo or surround sound recording), each channel can be considered as a separate signal and can be visualized as a separate bar chart or wave. Similarly, in a video, every channel of every pixel can be considered as a separate signal, rising and falling (becoming brighter and dimmer) over time. Imagine that we use a stationary camera to capture a video of a metronome. Then, certain pixel values rise and fall at a regular interval as they capture the passage of the metronome's needle. If the camera has an attached microphone, its signal values rise and fall at the same interval. Based on either the audio or the video, we can measure the metronome's frequency—its beats per minute (bpm) or its beats per second (Hertz or Hz). Conversely, if we change the metronome's bpm setting, the effect on both the audio and the video is predictable. From this thought experiment, we can learn that a signal—be it audio, video, or any other kind—can be expressed as a function of time and, equivalently, a function of frequency. Consider the following pair of graphs. They express the same signal, first as a function of time and then as a function of frequency. Within the time domain, we see one wide peak and valley (in other words, a tapering effect) spanning many narrow peaks and valleys. Within the frequency domain, we see a low-frequency peak and a high-frequency peak. The transformation from the time domain to the frequency domain is called the Fourier transform. Conversely, the transformation from the frequency domain to the time domain is called the inverse Fourier transform. Within the digital world, signals are discrete, not continuous, and we use the terms discrete Fourier transform (DFT) and inverse discrete Fourier transform (IDFT). There is a variety of efficient algorithms to compute the DFT or IDFT and such an algorithm might be described as a Fast Fourier Transform or an Inverse Fast Fourier Transform. For algorithmic descriptions, refer to the following Wikipedia article: http://en.wikipedia.org/wiki/Fast_Fourier_transform. The result of the Fourier transform (including its discrete variants) is a function that maps a frequency to an amplitude and phase. The amplitude represents the magnitude of the frequency's contribution to the signal. The phase represents a temporal shift; it determines whether the frequency's contribution starts on a high or a low. Typically, amplitude and phase are encoded in a complex number, a+bi, where amplitude=sqrt(a^2+b^2) and phase=atan2(a,b). For an explanation of complex numbers, refer to the following Wikipedia article: http://en.wikipedia.org/wiki/Complex_number. The FFT and IFFT are fundamental to a field of computer science called digital signal processing. Many signal processing applications, including Lazy Eyes, involve taking the signal's FFT, modifying or removing certain frequencies in the FFT result, and then reconstructing the filtered signal in the time domain using the IFFT. For example, this approach enables us to amplify certain frequencies while leaving others unchanged. Now, where do we find this functionality? Choosing and setting up an FFT library Several Python libraries provide FFT and IFFT implementations that can process NumPy arrays (and thus OpenCV images). Here are the five major contenders: NumPy: This provides FFT and IFFT implementations in a module called numpy.fft (for more information, refer to http://docs.scipy.org/doc/numpy/reference/routines.fft.html). The module also offers other signal processing functions to work with the output of the FFT. SciPy: This provides FFT and IFFT implementations in a module called scipy.fftpack (for more information refer to http://docs.scipy.org/doc/scipy/reference/fftpack.html). This SciPy module is closely based on the numpy.fft module, but adds some optional arguments and dynamic optimizations based on the input format. The SciPy module also adds more signal processing functions to work with the output of the FFT. OpenCV: This has implementations of FFT (cv2.dft) and IFT (cv2.idft). An official tutorial provides examples and a comparison to NumPy's FFT implementation at http://docs.opencv.org/doc/tutorials/core/discrete_fourier_transform/discrete_fourier_transform.html. OpenCV's FFT and IFT interfaces are not directly interoperable with the numpy.fft and scipy.fftpack modules that offer a broader range of signal processing functionality. (The data is formatted very differently.) PyFFTW: This is a Python wrapper (https://hgomersall.github.io/pyFFTW/) around a C library called the Fastest Fourier Transform in the West (FFTW) (for more information, refer to http://www.fftw.org/). FFTW provides multiple implementations of FFT and IFFT. At runtime, it dynamically selects implementations that are well optimized for given input formats, output formats, and system capabilities. Optionally, it takes advantage of multithreading (and its threads might run on multiple CPU cores, as the implementation releases Python's Global Interpreter Lock or GIL). PyFFTW provides optional interfaces matching NumPy's and SciPy's FFT and IFFT functions. These interfaces have a low overhead cost (thanks to good caching options that are provided by PyFFTW) and they help to ensure that PyFFTW is interoperable with a broader range of signal processing functionality as implemented in numpy.fft and scipy.fftpack. Reinka: This is a Python library for GPU-accelerated computations using either PyCUDA (http://mathema.tician.de/software/pycuda/) or PyOpenCL (http://mathema.tician.de/software/pyopencl/) as a backend. Reinka (http://reikna.publicfields.net/en/latest/) provides FFT and IFFT implementations in a module called reikna.fft. Reinka internally uses PyCUDA or PyOpenCL arrays (not NumPy arrays), but it provides interfaces for conversion from NumPy arrays to these GPU arrays and back. The converted NumPy output is compatible with other signal processing functionality as implemented in numpy.fft and scipy.fftpack. However, this compatibility comes at a high overhead cost due to locking, reading, and converting the contents of the GPU memory. NumPy, SciPy, OpenCV, and PyFFTW are open-source libraries under the BSD license. Reinka is an open-source library under the MIT license. I recommend PyFFTW because of its optimizations and its interoperability (at a low overhead cost) with all the other functionality that interests us in NumPy, SciPy, and OpenCV. For a tour of PyFFTW's features, including its NumPy- and SciPy-compatible interfaces, refer to the official tutorial at https://hgomersall.github.io/pyFFTW/sphinx/tutorial.html. Depending on our platform, we can set up PyFFTW in one of the following ways: In Windows, download and run a binary installer from https://pypi.python.org/pypi/pyFFTW. Choose the installer for either a 32-bit Python 2.7 or 64-bit Python 2.7 (depending on whether your Python installation, not necessarily your system, is 64-bit). In Mac with MacPorts, run the following command in Terminal: $ sudo port install py27-pyfftw In Ubuntu 14.10 (Utopic) and its derivatives, including Linux Mint 14.10, run the following command in Terminal: $ sudo apt-get install python-fftw3 In Ubuntu 14.04 and earlier versions (and derivatives thereof), do not use this package, as its version is too old. Instead, use the PyFFTW source bundle, as described in the last bullet of this list. In Debian Jessie, Debian Sid, and their derivatives, run the following command in Terminal: $ sudo apt-get install python-pyfftw In Debian Wheezy and its derivatives, including Raspbian, this package does not exist. Instead, use the PyFFTW source bundle, as described in the next bullet. For any other system, download the PyFFTW source bundle from https://pypi.python.org/pypi/pyFFTW. Decompress it and run the setup.py script inside the decompressed folder. Some old versions of the library are called PyFFTW3. We do not want PyFFTW3. However, on Ubuntu 14.10 and its derivatives, the packages are misnamed such that python-fftw3 is really the most recent packaged version (whereas python-fftw is an older PyFFTW3 version). We have our FFT and IFFT needs covered by FFTW (and if we were cowboys instead of secret agents, we could say, "Cover me!"). For additional signal processing functionality, we will use SciPy. Signal processing is not the only new material that we must learn for Lazy Eyes, so let's now look at other functionality that is provided by OpenCV. Compositing two images using image pyramids Running an FFT on a full-resolution video feed would be slow. Also, the resulting frequencies would reflect localized phenomena at each captured pixel, such that the motion map (the result of filtering the frequencies and then applying the IFFT) might appear noisy and overly sharpened. To address these problems, we want a cheap, blurry downsampling technique. However, we also want the option to enhance edges, which are important to our perception of motion. Our need for a blurry downsampling technique is fulfilled by a Gaussian image pyramid. A Gaussian filter blurs an image by making each output pixel a weighted average of many input pixels in the neighborhood. An image pyramid is a series in which each image is half the width and height of the previous image. The halving of image dimensions is achieved by decimation, meaning that every other pixel is simply omitted. A Gaussian image pyramid is constructed by applying a Gaussian filter before each decimation operation. Our need to enhance edges in downsampled images is fulfilled by a Laplacian image pyramid, which is constructed in the following manner. Suppose we have already constructed a Gaussian image pyramid. We take the image at level i+1 in the Gaussian pyramid, upsample it by duplicating the pixels, and apply a Gaussian filter to it again. We then subtract the result from the image at level i in the Gaussian pyramid to produce the corresponding image at level i of the Laplacian pyramid. Thus, the Laplacian image is the difference between a blurry, downsampled image and an even blurrier image that was downsampled, downsampled again, and upsampled. You might wonder how such an algorithm is a form of edge finding. Consider that edges are areas of local contrast, while non-edges are areas of local uniformity. If we blur a uniform area, it is still uniform—zero difference. If we blur a contrasting area, it becomes more uniform—nonzero difference. Thus, the difference can be used to find edges. The Gaussian and Laplacian image pyramids are described in detail in the journal article downloadable from http://web.mit.edu/persci/people/adelson/pub_pdfs/RCA84.pdf. This article is written by E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden on Pyramid methods in image processing, RCA Engineer, vol. 29, no. 6, November/December 1984. Besides using image pyramids to downsample the FFT's input, we also use them to upsample the most recent frame of the IFFT's output. This upsampling step is necessary to create an overlay that matches the size of the original camera image so that we can composite the two. Like in the construction of the Laplacian pyramid, upsampling consists of duplicating pixels and applying a Gaussian filter. OpenCV implements the relevant downsizing and upsizing functions as cv2.pyrDown and cv2.pyrUp. These functions are useful in compositing two images in general (whether or not signal processing is involved) because they enable us to soften differences while preserving edges. The OpenCV documentation includes a good tutorial on this topic at http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_pyramids/py_pyramids.html. Now, we are armed with the knowledge to implement Lazy Eyes! Implementing the Lazy Eyes app Let's create a new folder for Lazy Eyes and, in this folder, create copies of or links to the ResizeUtils.py and WxUtils.py files from any of our previous Python. Alongside the copies or links, let's create a new file, LazyEyes.py. Edit it and enter the following import statements: import collectionsimport numpyimport cv2import threadingimport timeitimport wx import pyfftw.interfaces.cachefrom pyfftw.interfaces.scipy_fftpack import fftfrom pyfftw.interfaces.scipy_fftpack import ifftfrom scipy.fftpack import fftfreq import ResizeUtilsimport WxUtils Besides the modules that we have used in previous projects, we are now using the standard library's collections module for efficient collections and timeit module for precise timing. Also for the first time, we are using signal processing functionality from PyFFTW and SciPy. Like our other Python applications, Lazy Eyes is implemented as a class that extends wx.Frame. Here are the declarations of the class and its initializer: class LazyEyes(wx.Frame):  def __init__(self, maxHistoryLength=360,   minHz=5.0/6.0, maxHz=1.0,   amplification=32.0, numPyramidLevels=2,   useLaplacianPyramid=True,   useGrayOverlay=True,   numFFTThreads = 4, numIFFTThreads=4,   cameraDeviceID=0, imageSize=(480, 360),   title='Lazy Eyes'): The initializer's arguments affect the app's frame rate and the manner in which motion is amplified. These effects are discussed in detail in the section Configuring and testing the app for various motions, later in this article. The following is just a brief description of the arguments: maxHistoryLength is the number of frames (including the current frame and preceding frames) that are analyzed for motion. minHz and maxHz, respectively, define the slowest and fastest motions that are amplified. amplification is the scale of the visual effect. A higher value means motion is highlighted more brightly. numPyramidLevels is the number of pyramid levels by which frames are downsampled before signal processing is done. Remember that each level corresponds to downsampling by a factor of 2. Our implementation assumes numPyramidLevels>0. If useLaplacianPyramid is True, frames are downsampled using a Laplacian pyramid before the signal processing is done. The implication is that only edge motion is highlighted. Alternatively, if useLaplacianPyramid is False, a Gaussian pyramid is used, and the motion in all areas is highlighted. If useGrayOverlay is True, frames are converted to grayscale before the signal processing is done. The implication is that motion is only highlighted in areas of grayscale contrast. Alternatively, if useGrayOverlay is False, motion is highlighted in areas that have contrast in any color channel. numFFTThreads and numIFFTThreads, respectively, are the numbers of threads used in FFT and IFFT computations. cameraDeviceID and imageSize are our usual capture parameters. The initializer's implementation begins in the same way as our other Python apps. It sets flags to indicate that the app is running and (by default) should be mirrored. It creates the capture object and configures its resolution to match the requested width and height if possible. Failing that, the device's default capture resolution is used. The relevant code is as follows:    self.mirrored = True    self._running = True    self._capture = cv2.VideoCapture(cameraDeviceID)   size = ResizeUtils.cvResizeCapture(       self._capture, imageSize)   w, h = size   self._imageWidth = w   self._imageHeight = h Next, we will determine the shape of the history of frames. It has at least 3 dimensions: a number of frames, a width, and height for each frame. The width and height are downsampled from the capture width and height based on the number of pyramid levels. If we are concerned about the color motion and not just the grayscale motion, the history also has a fourth dimension, consisting of 3 color channels. Here is the code to calculate the history's shape:    self._useGrayOverlay = useGrayOverlay   if useGrayOverlay:     historyShape = (maxHistoryLength,             h >> numPyramidLevels,             w >> numPyramidLevels)   else:     historyShape = (maxHistoryLength,             h >> numPyramidLevels,             w >> numPyramidLevels, 3) Note the use of >>, the right bit shift operator, to reduce the dimensions by a power of 2. The power is equal to the number of pyramid levels. We will store the specified maximum history length. For the frames in the history, we will create a NumPy array of the shape that we just determined. For timestamps of the frames, we will create a deque (double-ended queue), a type of collection that allows us to cheaply add or remove elements from either end:    self._maxHistoryLength = maxHistoryLength   self._history = numpy.empty(historyShape,                 numpy.float32)   self._historyTimestamps = collections.deque() We will store the remaining arguments because later, in each frame, we will pass them to the pyramid functions and signal the processing functions:    self._numPyramidLevels = numPyramidLevels   self._useLaplacianPyramid = useLaplacianPyramid    self._minHz = minHz   self._maxHz = maxHz   self._amplification = amplification    self._numFFTThreads = numFFTThreads   self._numIFFTThreads = numIFFTThreads To ensure meaningful error messages and early termination in the case of invalid arguments, we could add code such as the following for each argument: assert numPyramidLevels > 0,      'numPyramidLevels must be positive.' For brevity, such assertions are omitted from our code samples. We call the following two functions to tell PyFFTW to cache its data structures (notably, its NumPy arrays) for a period of at least 1.0 second from their last use. (The default is 0.1 seconds.) Caching is a critical optimization for the PyFFTW interfaces that we are using, and we will choose a period that is more than long enough to keep the cache alive from frame to frame:    pyfftw.interfaces.cache.enable()   pyfftw.interfaces.cache.set_keepalive_time(1.0) The initializer's implementation ends with code to set up a window, event bindings, a bitmap, layout, and background thread—all familiar tasks from our previous Python projects:    style = wx.CLOSE_BOX | wx.MINIMIZE_BOX |        wx.CAPTION | wx.SYSTEM_MENU |        wx.CLIP_CHILDREN   wx.Frame.__init__(self, None, title=title,             style=style, size=size)    self.Bind(wx.EVT_CLOSE, self._onCloseWindow)    quitCommandID = wx.NewId()   self.Bind(wx.EVT_MENU, self._onQuitCommand,         id=quitCommandID)   acceleratorTable = wx.AcceleratorTable([     (wx.ACCEL_NORMAL, wx.WXK_ESCAPE,       quitCommandID)   ])   self.SetAcceleratorTable(acceleratorTable)    self._staticBitmap = wx.StaticBitmap(self,                       size=size)   self._showImage(None)    rootSizer = wx.BoxSizer(wx.VERTICAL)   rootSizer.Add(self._staticBitmap)   self.SetSizerAndFit(rootSizer)    self._captureThread = threading.Thread(       target=self._runCaptureLoop)   self._captureThread.start() We must modify our usual _onCloseWindow callback to disable PyFFTW's cache. Disabling the cache ensures that resources are freed and that PyFFTW's threads terminate normally. The callback's implementation is given in the following code:  def _onCloseWindow(self, event):   self._running = False   self._captureThread.join()   pyfftw.interfaces.cache.disable()   self.Destroy() The Esc key is bound to our usual _onQuitCommand callback, which just closes the app:  def _onQuitCommand(self, event):   self.Close() The loop running on our background thread is similar to the one in our other Python apps. In each frame, it calls a helper function, _applyEulerianVideoMagnification. Here is the loop's implementation.  def _runCaptureLoop(self):   while self._running:     success, image = self._capture.read()     if image is not None:        self._applyEulerianVideoMagnification(           image)       if (self.mirrored):         image[:] = numpy.fliplr(image)     wx.CallAfter(self._showImage, image) The _applyEulerianVideoMagnification helper function is quite long so we will consider its implementation in several chunks. First, we will create a timestamp for the frame and copy the frame to a format that is more suitable for processing. Specifically, we will use a floating point array with either one gray channel or 3 color channels, depending on the configuration:  def _applyEulerianVideoMagnification(self, image):    timestamp = timeit.default_timer()    if self._useGrayOverlay:     smallImage = cv2.cvtColor(         image, cv2.COLOR_BGR2GRAY).astype(             numpy.float32)   else:     smallImage = image.astype(numpy.float32) Using this copy, we will calculate the appropriate level in the Gaussian or Laplacian pyramid:    # Downsample the image using a pyramid technique.   i = 0   while i < self._numPyramidLevels:     smallImage = cv2.pyrDown(smallImage)     i += 1   if self._useLaplacianPyramid:     smallImage[:] -=        cv2.pyrUp(cv2.pyrDown(smallImage)) For the purposes of the history and signal processing functions, we will refer to this pyramid level as "the image" or "the frame". Next, we will check the number of history frames that have been filled so far. If the history has more than one unfilled frame (meaning the history will still not be full after adding this frame), we will append the new image and timestamp and then return early, such that no signal processing is done until a later frame:    historyLength = len(self._historyTimestamps)    if historyLength < self._maxHistoryLength - 1:      # Append the new image and timestamp to the     # history.     self._history[historyLength] = smallImage     self._historyTimestamps.append(timestamp)      # The history is still not full, so wait.     return If the history is just one frame short of being full (meaning the history will be full after adding this frame), we will append the new image and timestamp using the code given as follows:    if historyLength == self._maxHistoryLength - 1:     # Append the new image and timestamp to the     # history.     self._history[historyLength] = smallImage     self._historyTimestamps.append(timestamp) If the history is already full, we will drop the oldest image and timestamp and append the new image and timestamp using the code given as follows:    else:     # Drop the oldest image and timestamp from the     # history and append the new ones.     self._history[:-1] = self._history[1:]     self._historyTimestamps.popleft()     self._history[-1] = smallImage     self._historyTimestamps.append(timestamp)    # The history is full, so process it. The history of image data is a NumPy array and, as such, we are using the terms "append" and "drop" loosely. NumPy arrays are immutable, meaning they cannot grow or shrink. Moreover, we are not recreating this array because it is large and reallocating it each frame would be expensive. We are just overwriting data within the array by moving the old data leftward and copying the new data in. Based on the timestamps, we will calculate the average time per frame in the history, as seen in the following code:    # Find the average length of time per frame.   startTime = self._historyTimestamps[0]   endTime = self._historyTimestamps[-1]   timeElapsed = endTime - startTime   timePerFrame =        timeElapsed / self._maxHistoryLength   #print 'FPS:', 1.0 / timePerFrame We will proceed with a combination of signal processing functions, collectively called a temporal bandpass filter. This filter blocks (zeros out) some frequencies and allows others to pass (remain unchanged). Our first step in implementing this filter is to run the pyfftw.interfaces.scipy_fftpack.fft function using the history and a number of threads as arguments. Also, with the argument axis=0, we will specify that the history's first axis is its time axis:    # Apply the temporal bandpass filter.   fftResult = fft(self._history, axis=0,           threads=self._numFFTThreads) We will pass the FFT result and the time per frame to the scipy.fftpack.fftfreq function. This function returns an array of midpoint frequencies (Hz in our case) corresponding to the indices in the FFT result. (This array answers the question, "Which frequency is the midpoint of the bin of frequencies represented by index i in the FFT?".) We will find the indices whose midpoint frequencies lie closest (minimum absolute value difference) to our initializer's minHz and maxHz parameters. Then, we will modify the FFT result by setting the data to zero in all ranges that do not represent the frequencies of interest:    frequencies = fftfreq(        self._maxHistoryLength, d=timePerFrame)   lowBound = (numpy.abs(       frequencies - self._minHz)).argmin()   highBound = (numpy.abs(       frequencies - self._maxHz)).argmin()   fftResult[:lowBound] = 0j   fftResult[highBound:-highBound] = 0j   fftResult[-lowBound:] = 0j The FFT result is symmetrical: fftResult[i] and fftResult[-i] pertain to the same bin of frequencies. Thus, we will modify the FFT result symmetrically. Remember, the Fourier transform maps a frequency to a complex number that encodes an amplitude and phase. Thus, while the indices of the FFT result correspond to frequencies, the values contained at those indices are complex numbers. Zero as a complex number is written in Python as 0+0j or 0j. Having thus filtered out the frequencies that do not interest us, we will finish applying the temporal bandpass filter by passing the data to the pyfftw.interfaces.scipy_fftpack.ifft function:    ifftResult = ifft(fftResult, axis=0,           threads=self._numIFFTThreads) From the IFFT result, we will take the most recent frame. It should somewhat resemble the current camera frame, but should be black in areas that do not exhibit recent motion matching our parameters. We will multiply this filtered frame so that the non-black areas become bright. Then, we will upsample it (using a pyramid technique) and add the result to the current camera frame so that areas of motion are lit up. Here is the relevant code, which concludes the _applyEulerianVideoMagnification method:    # Amplify the result and overlay it on the   # original image.   overlay = numpy.real(ifftResult[-1]) *            self._amplification   i = 0   while i < self._numPyramidLevels:     overlay = cv2.pyrUp(overlay)     i += 1   if self._useGrayOverlay:    overlay = cv2.cvtColor(overlay,                 cv2.COLOR_GRAY2BGR   cv2.convertScaleAbs(image + overlay, image) To finish the implementation of the LazyEyes class, we will display the image in the same manner as we have done in our other Python apps. Here is the relevant method:  def _showImage(self, image):   if image is None:     # Provide a black bitmap.     bitmap = wx.EmptyBitmap(self._imageWidth,                 self._imageHeight)   else:     # Convert the image to bitmap format.     bitmap = WXUtils.wxBitmapFromCvImage(image)   # Show the bitmap.   self._staticBitmap.SetBitmap(bitmap) Our module's main function just instantiates and runs the app, as seen in the following code: def main(): app = wx.App() lazyEyes = LazyEyes() lazyEyes.Show() app.MainLoop() if __name__ == '__main__': main() That's all! Run the app and stay quite still while it builds up its history of frames. Until the history is full, the video feed will not show any special effect. At the history's default length of 360 frames, it fills in about 20 seconds on my machine. Once it is full, you should see ripples moving through the video feed in areas of recent motion—or perhaps all areas if the camera is moved or the lighting or exposure is changed. The ripples will gradually settle and disappear in areas of the scene that become still, while new ripples will appear in new areas of motion. Feel free to experiment on your own. Now, let's discuss a few recipes for configuring and testing the parameters of the LazyEyes class. Configuring and testing the app for various motions Currently, our main function initializes the LazyEyes object with the default parameters. By filling in the same parameter values explicitly, we would have this statement:  lazyEyes = LazyEyes(maxHistoryLength=360,           minHz=5.0/6.0, maxHz=1.0,           amplification=32.0,           numPyramidLevels=2,           useLaplacianPyramid=True,           useGrayOverlay=True,           numFFTThreads = 4,           numIFFTThreads=4,           imageSize=(480, 360)) This recipe calls for a capture resolution of 480 x 360 and a signal processing resolution of 120 x 90 (as we are downsampling by 2 pyramid levels or a factor of 4). We are amplifying the motion only at frequencies of 0.833 Hz to 1.0 Hz, only at edges (as we are using the Laplacian pyramid), only in grayscale, and only over a history of 360 frames (about 20 to 40 seconds, depending on the frame rate). Motion is exaggerated by a factor of 32. These settings are suitable for many subtle upper-body movements such as a person's head swaying side to side, shoulders heaving while breathing, nostrils flaring, eyebrows rising and falling, and eyes scanning to and fro. For performance, the FFT and IFFT are each using 4 threads. Here is a screenshot of the app that runs with these default parameters. Moments before taking the screenshot, I smiled like a comic theater mask and then I recomposed my normal expression. Note that my eyebrows and moustache are visible in multiple positions, including their current, low positions and their previous, high positions. For the sake of capturing the motion amplification effect in a still image, this gesture is quite exaggerated. However, in a moving video, we can see the amplification of more subtle movements too. Here is a less extreme example where my eyebrows appear very tall because I raised and lowered them: The parameters interact with each other in complex ways. Consider the following relationships between these parameters: Frame rate is greatly affected by the size of the input data for the FFT and IFFT functions. The size of the input data is determined by maxHistoryLength (where a shorter length provides less input and thus a faster frame rate), numPyramidLevels (where a greater level implies less input), useGrayOverlay (where True means less input), and imageSize (where a smaller size implies less input). Frame rate is also greatly affected by the level of multithreading of the FFT and IFFT functions, as determined by numFFTThreads and numIFFTThreads (a greater number of threads is faster up to some point). Frame rate is slightly affected by useLaplacianPyramid (False implies a faster frame rate), as the Laplacian algorithm requires extra steps beyond the Gaussian image. Frame rate determines the amount of time that maxHistoryLength represents. Frame rate and maxHistoryLength determine how many repetitions of motion (if any) can be captured in the minHz to maxHz range. The number of captured repetitions, together with amplification, determines how greatly a motion or a deviation from the motion will be amplified. The inclusion or exclusion of noise is affected by minHz and maxHz (depending on which frequencies of noise are characteristic of the camera), numPyramidLevels (where more implies a less noisy image), useLaplacianPyramid (where True is less noisy), useGrayOverlay (where True is less noisy), and imageSize (where a smaller size implies a less noisy image). The inclusion or exclusion of motion is affected by numPyramidLevels (where fewer means the amplification is more inclusive of small motions), useLaplacianPyramid (False is more inclusive of motion in non-edge areas), useGrayOverlay (False is more inclusive of motion in areas of color contrast), minHz (where a lower value is more inclusive of slow motion), maxHz (where a higher value is more inclusive of fast motion), and imageSize (where bigger size is more inclusive of small motions). Subjectively, the visual effect is always best when the frame rate is high, noise is excluded, and small motions are included. Again subjectively, other conditions for including or excluding motion (edge versus non-edge, grayscale contrast versus color contrast, and fast versus slow) are application-dependent. Let's try our hand at reconfiguring Lazy Eyes, starting with the numFFTThreads and numIFFTThreads parameters. We want to determine the numbers of threads that maximize Lazy Eyes' frame rate on your machine. The more CPU cores you have, the more threads you can gainfully use. However, experimentation is the best guide to pick a number. To get a log of the frame rate in Lazy Eyes, uncomment the following line in the _applyEulerianVideoMagnification method:    print 'FPS:', 1.0 / timePerFrame Run LazyEyes.py from the command line. Once the history fills up, the history's average FPS will be printed to the command line in every frame. Wait until this average FPS value stabilizes. It might take a minute for the average to adjust to the effect of the FFT and IFFT functions that begin running once the history is full. Take note of the FPS value, close the app, adjust the thread count parameters, and test again. Repeat until you feel that you have enough data to pick good numbers of threads to use on your hardware. By activating additional CPU cores, multithreading can cause your system's temperature to rise. As you experiment, monitor your machine's temperature, fans, and CPU usage statistics. If you become concerned, reduce the number of FFT and IFFT threads. Having a suboptimal frame rate is better than overheating of your machine. Now, experiment with other parameters to see how they affect FPS. The numPyramidLevels, useGrayOverlay, and imageSize parameters should all have a large effect. For subjectively good visual results, try to maintain a frame rate above 10 FPS. A frame rate above 15 FPS is excellent. When you are satisfied that you understand the parameters' effects on frame rate, comment out the following line again because the print statement is itself an expensive operation that can reduce frame rate:    #print 'FPS:', 1.0 / timePerFrame Let's try another recipe. Although our default recipe accentuates motion at edges that have high grayscale contrast, this next recipe accentuates motion in all areas (edge or non-edge) that have high contrast (color or grayscale). By considering 3 color channels instead of one grayscale channel, we are tripling the amount of data being processed by the FFT and IFFT. To offset this change, we will cut each dimension of the capture resolution to two third times its default value, thus reducing the amount of data to 2/3 * 2/3 = 4/9 times the default amount. As a net change, the FFT and IFFT process 3 * 4/9 = 4/3 times the default amount of data, a relatively small change. The following initialization statement shows our new recipe's parameters:  lazyEyes = LazyEyes(useLaplacianPyramid=False,           useGrayOverlay=False,           imageSize=(320, 240)) Note that we are still using the default values for most parameters. If you have found non-default values that work well for numFFTThreads and numIFFTThreads on your machine, enter them as well. Here are the screenshots to show our new recipe's effect. This time, let's look at a non-extreme example first. I was typing on my laptop when this was taken. Note the haloes around my arms, which move a lot when I type, and a slight distortion and discoloration of my left cheek (viewer's left in this mirrored image). My left cheek twitches a little when I think. Apparently, it is a tic—already known to my friends and family but rediscovered by me with the help of computer vision! If you are viewing the color version of this image in the e-book, you should see that the haloes around my arms take a green hue from my shirt and a red hue from the sofa. Similarly, the haloes on my cheek take a magenta hue from my skin and a brown hue from my hair. Now, let's consider a more fanciful example. If we were Jedi instead of secret agents, we might wave a steel ruler in the air and pretend it was a lightsaber. While testing the theory that Lazy Eyes could make the ruler look like a real lightsaber, I took the following screenshot. The image shows two pairs of light and dark lines in two places where I was waving the lightsaber ruler. One of the pairs of lines passes through each of my shoulders. The Light Side (light line) and the Dark Side (dark line) show opposite ends of the ruler's path as I waved it. The lines are especially clear in the color version in the e-book. Finally, the moment for which we have all been waiting is … a recipe for amplifying a heartbeat! If you have a heart rate monitor, start by measuring your heart rate. Mine is approximately 87 bpm as I type these words and listen to inspiring ballads by the Canadian folk singer Stan Rogers. To convert bpm to Hz, divide the bpm value by 60 (the number of seconds per minute), giving (87 / 60) Hz = 1.45 Hz in my case. The most visible effect of a heartbeat is that a person's skin changes color, becoming more red or purple when blood is pumped through an area. Thus, let's modify our second recipe, which is able to amplify color motions in non-edge areas. Choosing a frequency range centered on 1.45 Hz, we have the following initializer:  lazyEyes = LazyEyes(minHz=1.4, maxHz=1.5,           useLaplacianPyramid=False,          useGrayOverlay=False,           imageSize=(320, 240)) Customize minHz and maxHz based on your own heart rate. Also, specify numFFTThreads and numIFFTThreads if non-default values work best for you on your machine. Even amplified, a heartbeat is difficult to show in still images; it is much clearer in the live video while running the app. However, take a look at the following pair of screenshots. My skin in the left-hand side screenshot is more yellow (and lighter) while in the right-hand side screenshot it is more purple (and darker). For comparison, note that there is no change in the cream-colored curtains in the background. Three recipes are a good start—certainly enough to fill an episode of a cooking show on TV. Go observe some other motions in your environment, try to estimate their frequencies, and then configure Lazy Eyes to amplify them. How do they look with grayscale amplification versus color amplification? Edge (Laplacian) versus area (Gaussian)? Different history lengths, pyramid levels, and amplification multipliers? Check the book's support page, http://www.nummist.com/opencv, for additional recipes and feel free to share your own by mailing me at josephhowse@nummist.com. Seeing things in another light Although we began this article by presenting Eulerian video magnification as a useful technique for visible light, it is also applicable to other kinds of light or radiation. For example, a person's blood (in veins and bruises) is more visible when imaged in ultraviolet (UV) or in near infrared (NIR) than in visible light. (Skin is more transparent to UV and NIR). Thus, a UV or NIR video is likely to be a better input if we are trying to magnify a person's pulse. Here are some examples of NIR and UV cameras that might provide useful results, though I have not tested them: The Pi NoIR camera (http://www.raspberrypi.org/products/pi-noir-camera/) is a consumer grade NIR camera with a MIPI interface. Here is a time lapse video showing the Pi NoIR renders outdoor scenes at https://www.youtube.com/watch?v=LLA9KHNvUK8. The camera is designed for Raspberry Pi, and on Raspbian it has V4L-compatible drivers that are directly compatible with OpenCV's VideoCapture class. Some Raspberry Pi clones might have drivers for it too. Unfortunately, Raspberry Pi is too slow to run Eulerian video magnification in real time. However, streaming the Pi NoIR input from Raspberry Pi to a desktop, via Ethernet, might allow for a real-time solution. The Agama V-1325R (http://www.agamazone.com/products_v1325r.html) is a consumer grade NIR camera with a USB interface. It is officially supported on Windows and Mac. Users report that it also works on Linux. It includes four NIR LEDs, which can be turned on and off via the vendor's proprietary software on Windows. Artray offers a series of industrial grade NIR cameras called InGaAs (http://www.artray.us/ingaas.html), as well as a series of industrial grade UV cameras (http://www.artray.us/usb2_uv.html). The cameras have USB interfaces. Windows drivers and an SDK are available from the vendor. A third-party project called OpenCV ARTRAY SDK (for more information refer to https://github.com/eiichiromomma/CVMLAB/wiki/OpenCV-ARTRAY-SDK) aims to provide interoperability with at least OpenCV's C API. Good luck and good hunting in the invisible light! Summary This article has introduced you to the relationship between computer vision and digital signal processing. We have considered a video feed as a collection of many signals—one for each channel value of each pixel—and we have understood that repetitive motions create wave patterns in some of these signals. We have used the Fast Fourier Transform and its inverse to create an alternative video stream that only sees certain frequencies of motion. Finally, we have superimposed this filtered video atop the original to amplify the selected frequencies of motion. There, we summarized Eulerian video magnification in 100 words! Our implementation adapts Eulerian video magnification to real time by running the FFT repeatedly on a sliding window of recently captured frames rather than running it once on an entire prerecorded video. We have considered optimizations such as limiting our signal processing to grayscale, recycling large data structures rather than recreating them, and using several threads. Resources for Article: Further resources on this subject: New functionality in OpenCV 3.0 [Article] Wrapping OpenCV [Article] Camera Calibration [Article]
Read more
  • 0
  • 0
  • 5726

article-image-cloud
Packt
22 Jan 2015
14 min read
Save for later

In the Cloud

Packt
22 Jan 2015
14 min read
In this article by Rafał Kuć, author of the book Solr Cookbook - Third Edition, covers the cloud side of Solr—SolrCloud, setting up collections, replicas configuration, distributed indexing and searching, as well as aliasing and shard manipulation. We will also learn how to create a cluster. (For more resources related to this topic, see here.) Creating a new SolrCloud cluster Imagine a situation where one day you have to set up a distributed cluster with the use of Solr. The amount of data is just too much for a single server to handle. Of course, you can just set up a second server or go for another master server with another set of data. But before Solr 4.0, you would have to take care of the data distribution yourself. In addition to this, you would also have to take care of setting up replication, data duplication, and so on. With SolrCloud you don't have to do this—you can just set up a new cluster and this article will show you how to do that. Getting ready It shows you how to set up a Zookeeper cluster in order to be ready for production use. How to do it... Let's assume that we want to create a cluster that will have four Solr servers. We also would like to have our data divided between the four Solr servers in such a way that we have the original data on two machines and in addition to this, we would also have a copy of each shard available in case something happens with one of the Solr instances. I also assume that we already have our Zookeeper cluster set up, ready, and available at the address 192.168.1.10 on the 9983 port. For this article, we will set up four SolrCloud nodes on the same physical machine: We will start by running an empty Solr server (without any configuration) on port 8983. We do this by running the following command (for Solr 4.x): java -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, we will run the following command: bin/solr -c -z 192.168.1.10:9983 Now we start another three nodes, each on a different port (note that different Solr instances can run on the same port, but they should be installed on different machines). We do this by running one command for each installed Solr server (for Solr 4.x): java -Djetty.port=6983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=4983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=2983 -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, the commands will be as follows: bin/solr -c -p 6983 -z 192.168.1.10:9983bin/solr -c -p 4983 -z 192.168.1.10:9983bin/solr -c -p 2983 -z 192.168.1.10:9983 Now we need to upload our collection configuration to ZooKeeper. Assuming that we have our configuration in /home/conf/solrconfiguration/conf, we will run the following command from the home directory of the Solr server that runs first (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost 192.168.1.10:9983 -confdir /home/conf/solrconfiguration/conf/ -confname collection1 Now we can create our collection using the following command: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=collection1' If we now go to http://localhost:8983/solr/#/~cloud, we will see the following cluster view: As we can see, Solr has created a new collection with a proper deployment. Let's now see how it works. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collection, because we didn't create them. For Solr 4.x, we started by running Solr and telling it that we want it to run in SolrCloud mode. We did that by specifying the -DzkHost property and setting its value to the IP address of our ZooKeeper instance. Of course, in the production environment, you would point Solr to a cluster of ZooKeeper nodes—this is done using the same property, but the IP addresses are separated using the comma character. For Solr 5, we used the solr script provided in the bin directory. By adding the -c switch, we told Solr that we want it to run in the SolrCloud mode. The -z switch works exactly the same as the -DzkHost property for Solr 4.x—it allows you to specify the ZooKeeper host that should be used. Of course, the other three Solr nodes run exactly in the same manner. For Solr 4.x, we add the -DzkHost property that points Solr to our ZooKeeper. Because we are running all the four nodes on the same physical machine, we needed to specify the -Djetty.port property, because we can run only a single Solr server on a single port. For Solr 5, we use the -z property of the bin/solr script and we use the -p property to specify the port on which Solr should start. The next step is to upload the collection configuration to ZooKeeper. We do this because Solr will fetch this configuration from ZooKeeper when you will request the collection creation. To upload the configuration, we use the zkcli.sh script provided with the Solr distribution. We use the upconfig command (the -cmd switch), which means that we want to upload the configuration. We specify the ZooKeeper host using the -zkHost property. After that, we can say which directory our configuration is stored (the -confdir switch). The directory should contain all the needed configuration files such as schema.xml, solrconfig.xml, and so on. Finally, we specify the name under which we want to store our configuration using the -confname switch. After we have our configuration in ZooKeeper, we can create the collection. We do this by running a command to the Collections API that is available at the /admin/collections endpoint. First, we tell Solr that we want to create the collection (action=CREATE) and that we want our collection to be named firstCollection (name=firstCollection). Remember that the collection names are case sensitive, so firstCollection and firstcollection are two different collections. We specify that we want our collection to be built of two primary shards (numShards=2) and we want each shard to be present in two copies (replicationFactor=2). This means that we will have a primary shard and a single replica. Finally, we specify which configuration should be used to create the collection by specifying the collection.configName property. As we can see in the cloud, a view of our cluster has been created and spread across all the nodes. There's more... There are a few things that I would like to mention—the possibility of running a Zookeeper server embedded into Apache Solr and specifying the Solr server name. Starting an embedded ZooKeeper server You can also start an embedded Zookeeper server shipped with Solr for your test environment. In order to do this, you should pass the -DzkRun parameter instead of -DzkHost=192.168.0.10:9983, but only in the command that sends our configuration to the Zookeeper cluster. So the final command for Solr 4.x should look similar to this: java -DzkRun -jar start.jar In Solr 5.0, the same command will be as follows: bin/solr start -c By default, ZooKeeper will start on the port higher by 1,000 to the one Solr is started at. So if you are running your Solr instance on 8983, ZooKeeper will be available at 9983. The thing to remember is that the embedded ZooKeeper should only be used for development purposes and only one node should start it. Specifying the Solr server name Solr needs each instance of SolrCloud to have a name. By default, that name is set using the IP address or the hostname, appended with the port the Solr instance is running on, and the _solr postfix. For example, if our node is running on 192.168.56.1 and port 8983, it will be called 192.168.56.1:8983_solr. Of course, Solr allows you to change that behavior by specifying the hostname. To do this, start using the -Dhost property or add the host property to solr.xml. For example, if we would like one of our nodes to have the name of server1, we can run the following command to start Solr: java -DzkHost=192.168.1.10:9983 -Dhost=server1 -jar start.jar In Solr 5.0, the same command would be: bin/solr start -c -h server1 Setting up multiple collections on a single cluster Having a single collection inside the cluster is nice, but there are multiple use cases when we want to have more than a single collection running on the same cluster. For example, we might want users and books in different collections or logs from each day to be only stored inside a single collection. This article will show you how to create multiple collections on the same cluster. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on the 2181 port and that we already have four SolrCloud nodes running as a cluster. How to do it... As we already have all the prerequisites, such as ZooKeeper and Solr up and running, we need to upload our configuration files to ZooKeeper to be able to create collections: Assuming that we have our configurations in /home/conf/firstcollection/conf and /home/conf/secondcollection/conf, we will run the following commands from the home directory of the first run Solr server to upload the configuration to ZooKeeper (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/firstcollection/conf/ -confname firstcollection./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/secondcollection/conf/ -confname secondcollection We have pushed our configurations into Zookeeper, so now we can create the collections we want. In order to do this, we use the following commands: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=firstcollection'curl 'localhost:8983/solr/admin/collections?action=CREATE&name=secondcollection&numShards=4&replicationFactor=1&collection.configName=secondcollection' Now, just to test whether everything went well, we will go to http://localhost:8983/solr/#/~cloud. As the result, we will see the following cluster topology: As we can see, both the collections were created the way we wanted. Now let's see how that happened. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collections, because we didn't create them. We also assumed that we have our SolrCloud cluster configured and started. We start by uploading two configurations to ZooKeeper, one called firstcollection and the other called secondcollection. After that we are ready to create our collections. We start by creating the collection named firstCollection that is built of two primary shards and one replica. The second collection, called secondcollection is built of four primary shards and it doesn't have any replicas. We can see that easily in the cloud view of the deployment. The firstCollection collection has two shards—shard1 and shard2. Each of the shard has two physical copies—one green (which means active) and one with a black dot, which is the primary shard. The secondcollection collection is built of four physical shards—each shard has a black dot near its name, which means that they are primary shards. Splitting shards Imagine a situation where you reach a limit of your current deployment—the number of shards is just not enough. For example, the indexing throughput is lower and lower, because the disks are not able to keep up. Of course, one of the possible solutions is to spread the index across more shards; however, you already have a collection and you want to keep the data and reindexing is not an option, because you don't have the original data. Solr can help you with such situations by allowing splitting shards of already created collections. This article will show you how to do it. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on port 2181 and that we already have four SolrCloud nodes running as a cluster. How to do it... Let's assume that we already have a SolrCloud cluster up and running and it has one collection called books. So our cloud view (which is available at http://localhost:8983/solr/#/~cloud) looks as follows: We have four nodes and we don't utilize them fully. We can say that these two nodes in which we have our shards are almost fully utilized. What we can do is create a new collection and reindex the data or we can split shards of the already created collection. Let's go with the second option: We start by splitting the first shard. It is as easy as running the following command: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard1' After this, we can split the second shard by running a similar command to the one we just used: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard2' Let's take a look at the cluster cloud view now (which is available at http://localhost:8983/solr/#/~cloud): As we can see, both shards were split—shard1 was divided into shard1_0 and shard1_1 and shard2 was divided into shard2_0 and shard2_1. Of course, the data was copied as well, so everything is ready. However, the last step should be to delete the original shards. Solr doesn't delete them, because sometimes applications use shard names to connect to a given shard. However, in our case, we can delete them by running the following commands: curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard1' curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard2' Now if we would again look at the cloud view of the cluster, we will see the following: How it works... We start with a simple collection called books that is built of two primary shards and no replicas. This is the collection which shards we will try to divide it without stopping Solr. Splitting shards is very easy. We just need to run a simple command in the Collections API (the /admin/collections endpoint) and specify that we want to split a shard (action=SPLITSHARD). We also need to provide additional information such as which collection we are interested in (the collection parameter) and which shard we want to split (the shard parameter). You can see the name of the shard by looking at the cloud view or by reading the cluster state from ZooKeeper. After sending the command, Solr might force us to wait for a substantial amount of time—shard splitting takes time, especially on large collections. Of course, we can run the same command for the second shard as well. Finally, we end up with six shards—four new and two old ones. The original shard will still contain data, but it will start to re-route requests to newly created shards. The data was split evenly between the new shards. The old shards were left although they are marked as inactive and they won't have any more data indexed to them. Because we don't need them, we can just delete them using the action=DELETESHARD command sent to the same Collections API. Similar to the split shard command, we need to specify the collection name, which shard we want to delete, and the name of the shard. After we delete the initial shards, we now see that our cluster view shows only four shards, which is what we were aiming at. We can now spread the shards across the cluster. Summary In this article, we learned how to set up multiple collections. This article thought us how to increase the number of collections in a cluster. We also worked on a way used to split shards. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [Article] Apache Solr PHP Integration [Article] Administrating Solr [Article]
Read more
  • 0
  • 0
  • 1916

article-image-taming-big-data-using-hdinsight
Packt
22 Jan 2015
10 min read
Save for later

Taming Big Data using HDInsight

Packt
22 Jan 2015
10 min read
(For more resources related to this topic, see here.) Era of Big Data In this article by Rajesh Nadipalli, the author of HDInsight Essentials Second Edition, we will take a look at the concept of Big Data and how to tame it using HDInsight. We live in a digital era and are always connected with friends and family using social media and smartphones. In 2014, every second, about 5,700 tweets were sent and 800 links were shared using Facebook, and the digital universe was about 1.7 MB per minute for every person on earth (source: IDC 2014 report). This amount of data sharing and storing is unprecedented and is contributing to what is known as Big Data. The following infographic shows you the details of our current use of the top social media sites (source: https://leveragenewagemedia.com/). Another contributor to Big Data are the smart, connected devices such as smartphones, appliances, cars, sensors, and pretty much everything that we use today and is connected to the Internet. These devices, which will soon be in trillions, continuously collect data and communicate with each other about their environment to make intelligent decisions and help us live better. This digitization of the world has added to the exponential growth of Big Data. According to the 2014 IDC digital universe report, the growth trend will continue and double in size every two years. In 2013, about 4.4 zettabytes were created and in 2020, the forecast is 44 zettabytes, which is 44 trillion gigabytes, (source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm). Business value of Big Data While we generated 4.4 zettabytes of data in 2013, only 5 percent of it was actually analyzed, and this is the real opportunity of Big Data. The IDC report forecasts that by 2020, we will analyze over 35 percent of the generated data by making smarter sensors and devices. This data will drive new consumer and business behavior that will drive trillions of dollars in opportunity for IT vendors and organizations analyzing this data. Let's take a look at some real use cases that have benefited from Big Data: IT systems in all major banks are constantly monitoring fraudulent activities and alerting customers within milliseconds. These systems apply complex business rules and analyze the historical data, geography, type of vendor, and other parameters based on the customer to get accurate results. Commercial drones are transforming agriculture by analyzing real-time aerial images and identifying the problem areas. These drones are cheaper and efficient than satellite imagery, as they fly under the clouds and can be used anytime. They identify the irrigation issues related to water, pests, or fungal infections thereby increasing the crop productivity and quality. These drones are equipped with technology to capture high-quality images every second and transfer them to a cloud-hosted Big Data system for further processing (reference: http://www.technologyreview.com/featuredstory/526491/agricultural-drones/). Developers of the blockbuster Halo 4 game were tasked to analyze player preferences and support an online tournament in the cloud. The game attracted over 4 million players in its first five days after its launch. The development team had to also design a solution that kept track of a leader board for the global Halo 4 Infinity challenge, which was open to all the players. The development team chose the Azure HDInsight service to analyze the massive amounts of unstructured data in a distributed manner. The results from HDInsight were reported using Microsoft SQL Server PowerPivot and Sharepoint and the business was extremely happy with the response times for their queries, which was a few hours or less, (source: http://www.microsoft.com/casestudies/Windows-Azure/343-Industries/343-Industries-Gets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102) Hadoop Concepts Apache Hadoop is the leading open source Big Data platform that can store and analyze massive amounts of structured and unstructured data efficiently and can be hosted on low-cost commodity hardware. There are other technologies that complement Hadoop under the Big Data umbrella such as MongoDB (a NoSQL database), Cassandra (a document database), and VoltDB (an in-memory database). This section describes Apache Hadoop core concepts and its ecosystem. A brief history of Hadoop Doug Cutting created Hadoop and named it after his kid's stuffed yellow elephant and has no real meaning. In 2004, the initial version of Hadoop was launched as Nutch Distributed Filesystem. In February 2006, the Apache Hadoop project was officially started as a standalone development for MapReduce and HDFS. By 2008, Yahoo adopted Hadoop as the engine of its web search with a cluster size of around 10,000. In the same year, Hadoop graduated as the top-level Apache project confirming its success. In 2012, Hadoop 2.x was launched with YARN enabling Hadoop to take on various types of workloads. Today, Hadoop is known by just about every IT architect and business executive as a open source Big Data platform and is used across all industries and sizes of organizations. Core components In this section, we will explore what Hadoop is actually comprised of. At the basic level, Hadoop consists of 4 layers: Hadoop Common: A set of common libraries and utilities used by Hadoop modules. Hadoop Distributed File System (HDFS): A scalable and fault tolerant distributed filesystem for data in any form. HDFS can be installed on commodity hardware and replicates data three times (which is configurable) to make the filesystem robust and tolerate partial hardware failures. Yet Another Resource Negotiator (YARN): From Hadoop 2.0, YARN is the cluster management layer to handle various workloads on the cluster. MapReduce: MapReduce is a framework that allows parallel processing of data in Hadoop. MapReduce breaks a job into smaller tasks and distributes the load to servers that have the relevant data. The design model is "move code and not data" making this framework efficient as it reduces the network and disk I/O required to move the data. The following diagram shows you the high-level Hadoop 2.0 core components: The preceding diagram shows you the components that form the basic Hadoop framework. In the past few years, a vast array of new components have emerged in the Hadoop ecosystem that take advantage of YARN making Hadoop faster, better, and suitable for various types of workloads. The following diagram shows you the Hadoop framework with these new components: Hadoop cluster layout Each Hadoop cluster has two types of machines, which are as follows: Master nodes: This includes HDFS Name Node, HDFS Secondary Name Node, and YARN Resource Manager. Worker nodes: This includes HDFS Data Nodes and YARN Node Managers. The data nodes and node managers are colocated for optimal data locality and performance. A network switch interconnects the master and worker nodes. It is recommended that you have separate servers for each of the master nodes; however, it is possible to deploy all the master nodes onto a single server for development or testing workloads. The following diagram shows you the typical cluster layout: Let's review the key functions of the master and worker nodes: Name node: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files, and the location of each block of a file that are stored across the various slaves. Without a name node, HDFS is not accessible. From Hadoop 2.0 onwards, name node HA (High Availability) can be configured with active and standby servers. Secondary name node: This is an assistant to the name node. It communicates only with the name node to take snapshots of HDFS metadata at intervals that is configured at the cluster level. YARN resource manager: This server is a scheduler that allocates the available resources in the cluster among the competing applications. Worker nodes: The Hadoop cluster will have several worker nodes that handle two types of functions—HDFS Data Node and YARN Node Manager. It is typical that each worker node handles both the functions for optimal data locality. This means processing happens on the data that is local to the node and follows the principle "move code and not data". HDInsight Overview HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on the Azure HDInsight cloud service (PaaS). It is 100 percent Apache Hadoop based service in the cloud. HDInsight was developed with the partnership of Hortonworks, and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and the Windows Azure cloud service. The following are the key differentiators for a HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Summary We live in a connected digital era and are witnessing unprecedented growth of data. Organizations that are able to analyze Big Data are demonstrating significant return on investment by detecting fraud, improved operations, and reduced time to analyze with a scale-out architecture. Apache Hadoop is the leading open source Big Data platform with strong and diverse ecosystem projects that enable organizations to build a modern data architecture. At the core, Hadoop has two key components: Hadoop Distributed File System also known as HDFS, and a cluster resource manager known as YARN. YARN has enabled Hadoop to be a true multi-use data platform that can handle batch processing, real-time streaming, interactive SQL, and others. Microsoft HDInsight is an enterprise-ready distribution of Hadoop on the cloud that has been developed with the partnership of Hortonworks and Microsoft. The key benefits of HDInsight include scaling up/down as required, analysis using Excel, connecting an on-premise Hadoop cluster with the cloud, and flexible programming and support for NoSQL transactional databases. Resources for Article: Further resources on this subject: Hadoop and HDInsight in a Heartbeat [article] Sizing and Configuring your Hadoop Cluster [article] Introducing Kafka [article]
Read more
  • 0
  • 0
  • 1920

article-image-highcharts-configurations
Packt
21 Jan 2015
53 min read
Save for later

Highcharts Configurations

Packt
21 Jan 2015
53 min read
This article is written by Joe Kuan, the author of Learning Highcharts 4. All Highcharts graphs share the same configuration structure and it is crucial for us to become familiar with the core components. However, it is not possible to go through all the configurations within the book. In this article, we will explore the functional properties that are most used and demonstrate them with examples. We will learn how Highcharts manages layout, and then explore how to configure axes, specify single series and multiple series data, followed by looking at formatting and styling tool tips in both JavaScript and HTML. After that, we will get to know how to polish our charts with various types of animations and apply color gradients. Finally, we will explore the drilldown interactive feature. In this article, we will cover the following topics: Understanding Highcharts layout Framing the chart with axes (For more resources related to this topic, see here.) Configuration structure In the Highcharts configuration object, the components at the top level represent the skeleton structure of a chart. The following is a list of the major components that are covered in this article: chart: This has configurations for the top-level chart properties such as layouts, dimensions, events, animations, and user interactions series: This is an array of series objects (consisting of data and specific options) for single and multiple series, where the series data can be specified in a number of ways xAxis/yAxis/zAxis: This has configurations for all the axis properties such as labels, styles, range, intervals, plotlines, plot bands, and backgrounds tooltip: This has the layout and format style configurations for the series data tool tips drilldown: This has configurations for drilldown series and the ID field associated with the main series title/subtitle: This has the layout and style configurations for the chart title and subtitle legend: This has the layout and format style configurations for the chart legend plotOptions: This contains all the plotting options, such as display, animation, and user interactions, for common series and specific series types exporting: This has configurations that control the layout and the function of print and export features For reference information concerning all configurations, go to http://api.highcharts.com. Understanding Highcharts' layout Before we start to learn how Highcharts layout works, it is imperative that we understand some basic concepts first. First, set a border around the plot area. To do that we can set the options of plotBorderWidth and plotBorderColor in the chart section, as follows:         chart: {                renderTo: 'container',                type: 'spline',                plotBorderWidth: 1,                plotBorderColor: '#3F4044'        }, The second border is set around the Highcharts container. Next, we extend the preceding chart section with additional settings:         chart: {                renderTo: 'container',                ....                borderColor: '#a1a1a1',                borderWidth: 2,                borderRadius: 3        }, This sets the container border color with a width of 2 pixels and corner radius of 3 pixels. As we can see, there is a border around the container and this is the boundary that the Highcharts display cannot exceed: By default, Highcharts displays have three different areas: spacing, labeling, and plot area. The plot area is the area inside the inner rectangle that contains all the plot graphics. The labeling area is the area where labels such as title, subtitle, axis title, legend, and credits go, around the plot area, so that it is between the edge of the plot area and the inner edge of the spacing area. The spacing area is the area between the container border and the outer edge of the labeling area. The following screenshot shows three different kinds of areas. A gray dotted line is inserted to illustrate the boundary between the spacing and labeling areas. Each chart label position can be operated in one of the following two layouts: Automatic layout: Highcharts automatically adjusts the plot area size based on the labels' positions in the labeling area, so the plot area does not overlap with the label element at all. Automatic layout is the simplest way to configure, but has less control. This is the default way of positioning the chart elements. Fixed layout: There is no concept of labeling area. The chart label is specified in a fixed location so that it has a floating effect on the plot area. In other words, the plot area side does not automatically adjust itself to the adjacent label position. This gives the user full control of exactly how to display the chart. The spacing area controls the offset of the Highcharts display on each side. As long as the chart margins are not defined, increasing or decreasing the spacing area has a global effect on the plot area measurements in both automatic and fixed layouts. Chart margins and spacing settings In this section, we will see how chart margins and spacing settings have an effect on the overall layout. Chart margins can be configured with the properties margin, marginTop, marginLeft, marginRight, and marginBottom, and they are not enabled by default. Setting chart margins has a global effect on the plot area, so that none of the label positions or chart spacing configurations can affect the plot area size. Hence, all the chart elements are in a fixed layout mode with respect to the plot area. The margin option is an array of four margin values covered for each direction, the same as in CSS, starting from north and going clockwise. Also, the margin option has a lower precedence than any of the directional margin options, regardless of their order in the chart section. Spacing configurations are enabled by default with a fixed value on each side. These can be configured in the chart section with the property names spacing, spacingTop, spacingLeft, spacingBottom, and spacingRight. In this example, we are going to increase or decrease the margin or spacing property on each side of the chart and observe the effect. The following are the chart settings:             chart: {                renderTo: 'container',                type: ...                marginTop: 10,                marginRight: 0,                spacingLeft: 30,                spacingBottom: 0            }, The following screenshot shows what the chart looks like: The marginTop property fixes the plot area's top border 10 pixels away from the container border. It also changes the top border into fixed layout for any label elements, so the chart title and subtitle float on top of the plot area. The spacingLeft property increases the spacing area on the left-hand side, so it pushes the y axis title further in. As it is in automatic layout (without declaring marginLeft), it also pushes the plot area's west border in. Setting marginRight to 0 will override all the default spacing on the chart's right-hand side and change it to fixed layout mode. Finally, setting spacingBottom to 0 makes the legend touch the lower bar of the container, so it also stretches the plot area downwards. This is because the bottom edge is still in automatic layout even though spacingBottom is set to 0. Chart label properties Chart labels such as xAxis.title, yAxis.title, legend, title, subtitle, and credits share common property names, as follows: align: This is for the horizontal alignment of the label. Possible keywords are 'left', 'center', and 'right'. As for the axis title, it is 'low', 'middle', and 'high'. floating: This is to give the label position a floating effect on the plot area. Setting this to true will cause the label position to have no effect on the adjacent plot area's boundary. margin: This is the margin setting between the label and the side of the plot area adjacent to it. Only certain label types have this setting. verticalAlign: This is for the vertical alignment of the label. The keywords are 'top', 'middle', and 'bottom'. x: This is for horizontal positioning in relation to alignment. y: This is for vertical positioning in relation to alignment. As for the labels' x and y positioning, they are not used for absolute positioning within the chart. They are designed for fine adjustment with the label alignment. The following diagram shows the coordinate directions, where the center represents the label location: We can experiment with these properties with a simple example of the align and y position settings, by placing both title and subtitle next to each other. The title is shifted to the left with align set to 'left', whereas the subtitle alignment is set to 'right'. In order to make both titles appear on the same line, we change the subtitle's y position to 15, which is the same as the title's default y value:  title: {     text: 'Web browsers ...',     align: 'left' }, subtitle: {     text: 'From 2008 to present',     align: 'right',     y: 15 }, The following is a screenshot showing both titles aligned on the same line: In the following subsections, we will experiment with how changes in alignment for each label element affect the layout behavior of the plot area. Title and subtitle alignments Title and subtitle have the same layout properties, and the only differences are that the default values and title have the margin setting. Specifying verticalAlign for any value changes from the default automatic layout to fixed layout (it internally switches floating to true). However, manually setting the subtitle's floating property to false does not switch back to automatic layout. The following is an example of title in automatic layout and subtitle in fixed layout:     title: {       text: 'Web browsers statistics'    },    subtitle: {       text: 'From 2008 to present',       verticalAlign: 'top',       y: 60       }, The verticalAlign property for the subtitle is set to 'top', which switches the layout into fixed layout, and the y offset is increased to 60. The y offset pushes the subtitle's position further down. Due to the fact that the plot area is not in an automatic layout relationship to the subtitle anymore, the top border of the plot area goes above the subtitle. However, the plot area is still in automatic layout towards the title, so the title is still above the plot area: Legend alignment Legends show different behavior for the verticalAlign and align properties. Apart from setting the alignment to 'center', all other settings in verticalAlign and align remain in automatic positioning. The following is an example of a legend located on the right-hand side of the chart. The verticalAlign property is switched to the middle of the chart, where the horizontal align is set to 'right':           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical'          }, The layout property is assigned to 'vertical' so that it causes the items inside the legend box to be displayed in a vertical manner. As we can see, the plot area is automatically resized for the legend box: Note that the border decoration around the legend box is disabled in the newer version. To display a round border around the legend box, we can add the borderWidth and borderRadius options using the following:           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical',                borderWidth: 1,                borderRadius: 3          }, Here is the legend box with a round corner border: Axis title alignment Axis titles do not use verticalAlign. Instead, they use the align setting, which is either 'low', 'middle', or 'high'. The title's margin value is the distance between the axis title and the axis line. The following is an example of showing the y-axis title rotated horizontally instead of vertically (which it is by default) and displayed on the top of the axis line instead of next to it. We also use the y property to fine-tune the title location:             yAxis: {                title: {                    text: 'Percentage %',                    rotation: 0,                    y: -15,                    margin: -70,                    align: 'high'                },                min: 0            }, The following is a screenshot of the upper-left corner of the chart showing that the title is aligned horizontally at the top of the y axis. Alternatively, we can use the offset option instead of margin to achieve the same result. Credits alignment Credits is a bit different from other label elements. It only supports the align, verticalAlign, x, and y properties in the credits.position property (shorthand for credits: { position: … }), and is also not affected by any spacing setting. Suppose we have a graph without a legend and we have to move the credits to the lower-left area of the chart, the following code snippet shows how to do it:             legend: {                enabled: false            },            credits: {                position: {                   align: 'left'                },                text: 'Joe Kuan',                href: 'http://joekuan.wordpress.com'            }, However, the credits text is off the edge of the chart, as shown in the following screenshot: Even if we move the credits label to the right with x positioning, the label is still a bit too close to the x axis interval label. We can introduce extra spacingBottom to put a gap between both labels, as follows:             chart: {                   spacingBottom: 30,                    ....            },            credits: {                position: {                   align: 'left',                   x: 20,                   y: -7                },            },            .... The following is a screenshot of the credits with the final adjustments: Experimenting with an automatic layout In this section, we will examine the automatic layout feature in more detail. For the sake of simplifying the example, we will start with only the chart title and without any chart spacing settings:      chart: {         renderTo: 'container',         // border and plotBorder settings         borderWidth: 2,         .....     },     title: {            text: 'Web browsers statistics,     }, From the preceding example, the chart title should appear as expected between the container and the plot area's borders: The space between the title and the top border of the container has the default setting spacingTop for the spacing area (a default value of 10-pixels high). The gap between the title and the top border of the plot area is the default setting for title.margin, which is 15-pixels high. By setting spacingTop in the chart section to 0, the chart title moves up next to the container top border. Hence the size of the plot area is automatically expanded upwards, as follows: Then, we set title.margin to 0; the plot area border moves further up, hence the height of the plot area increases further, as follows: As you may notice, there is still a gap of a few pixels between the top border and the chart title. This is actually due to the default value of the title's y position setting, which is 15 pixels, large enough for the default title font size. The following is the chart configuration for setting all the spaces between the container and the plot area to 0: chart: {     renderTo: 'container',     // border and plotBorder settings     .....     spacingTop: 0},title: {     text: null,     margin: 0,     y: 0} If we set title.y to 0, all the gap between the top edge of the plot area and the top container edge closes up. The following is the final screenshot of the upper-left corner of the chart, to show the effect. The chart title is not visible anymore as it has been shifted above the container: Interestingly, if we work backwards to the first example, the default distance between the top of the plot area and the top of the container is calculated as: spacingTop + title.margin + title.y = 10 + 15 + 15 = 40 Therefore, changing any of these three variables will automatically adjust the plot area from the top container bar. Each of these offset variables actually has its own purpose in the automatic layout. Spacing is for the gap between the container and the chart content; thus, if we want to display a chart nicely spaced with other elements on a web page, spacing elements should be used. Equally, if we want to use a specific font size for the label elements, we should consider adjusting the y offset. Hence, the labels are still maintained at a distance and do not interfere with other components in the chart. Experimenting with a fixed layout In the preceding section, we have learned how the plot area dynamically adjusted itself. In this section, we will see how we can manually position the chart labels. First, we will start with the example code from the beginning of the Experimenting with automatic layout section and set the chart title's verticalAlign to 'bottom', as follows: chart: {    renderTo: 'container',    // border and plotBorder settings    .....},title: {    text: 'Web browsers statistics',    verticalAlign: 'bottom'}, The chart title is moved to the bottom of the chart, next to the lower border of the container. Notice that this setting has changed the title into floating mode; more importantly, the legend still remains in the default automatic layout of the plot area: Be aware that we haven't specified spacingBottom, which has a default value of 15 pixels in height when applied to the chart. This means that there should be a gap between the title and the container bottom border, but none is shown. This is because the title.y position has a default value of 15 pixels in relation to spacing. According to the diagram in the Chart label properties section, this positive y value pushes the title towards the bottom border; this compensates for the space created by spacingBottom. Let's make a bigger change to the y offset position this time to show that verticalAlign is floating on top of the plot area:  title: {     text: 'Web browsers statistics',     verticalAlign: 'bottom',     y: -90 }, The negative y value moves the title up, as shown here: Now the title is overlapping the plot area. To demonstrate that the legend is still in automatic layout with regard to the plot area, here we change the legend's y position and the margin settings, which is the distance from the axis label:                legend: {                   margin: 70,                   y: -10               }, This has pushed up the bottom side of the plot area. However, the chart title still remains in fixed layout and its position within the chart hasn't been changed at all after applying the new legend setting, as shown in the following screenshot: By now, we should have a better understanding of how to position label elements, and their layout policy relating to the plot area. Framing the chart with axes In this section, we are going to look into the configuration of axes in Highcharts in terms of their functional area. We will start off with a plain line graph and gradually apply more options to the chart to demonstrate the effects. Accessing the axis data type There are two ways to specify data for a chart: categories and series data. For displaying intervals with specific names, we should use the categories field that expects an array of strings. Each entry in the categories array is then associated with the series data array. Alternatively, the axis interval values are embedded inside the series data array. Then, Highcharts extracts the series data for both axes, interprets the data type, and formats and labels the values appropriately. The following is a straightforward example showing the use of categories:     chart: {        renderTo: 'container',        height: 250,        spacingRight: 20    },    title: {        text: 'Market Data: Nasdaq 100'    },    subtitle: {        text: 'May 11, 2012'    },    xAxis: {        categories: [ '9:30 am', '10:00 am', '10:30 am',                       '11:00 am', '11:30 am', '12:00 pm',                       '12:30 pm', '1:00 pm', '1:30 pm',                       '2:00 pm', '2:30 pm', '3:00 pm',                       '3:30 pm', '4:00 pm' ],         labels: {             step: 3         }     },     yAxis: {         title: {             text: null         }     },     legend: {         enabled: false     },     credits: {         enabled: false     },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ 2606.01, 2622.08, 2636.03, 2637.78, 2639.15,                 2637.09, 2633.38, 2632.23, 2632.33, 2632.59,                 2630.34, 2626.89, 2624.59, 2615.98 ]     }] The preceding code snippet produces a graph that looks like the following screenshot: The first name in the categories field corresponds to the first value, 9:30 am, 2606.01, in the series data array, and so on. Alternatively, we can specify the time values inside the series data and use the type property of the x axis to format the time. The type property supports 'linear' (default), 'logarithmic', or 'datetime'. The 'datetime' setting automatically interprets the time in the series data into human-readable form. Moreover, we can use the dateTimeLabelFormats property to predefine the custom format for the time unit. The option can also accept multiple time unit formats. This is for when we don't know in advance how long the time span is in the series data, so each unit in the resulting graph can be per hour, per day, and so on. The following example shows how the graph is specified with predefined hourly and minute formats. The syntax of the format string is based on the PHP strftime function:     xAxis: {         type: 'datetime',          // Format 24 hour time to AM/PM          dateTimeLabelFormats: {                hour: '%I:%M %P',              minute: '%I %M'          }               },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                  [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                   [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                  .....                ]     }] Note that the x axis is in the 12-hour time format, as shown in the following screenshot: Instead, we can define the format handler for the xAxis.labels.formatter property to achieve a similar effect. Highcharts provides a utility routine, Highcharts.dateFormat, that converts the timestamp in milliseconds to a readable format. In the following code snippet, we define the formatter function using dateFormat and this.value. The keyword this is the axis's interval object, whereas this.value is the UTC time value for the instance of the interval:     xAxis: {         type: 'datetime',         labels: {             formatter: function() {                 return Highcharts.dateFormat('%I:%M %P', this.value);             }         }     }, Since the time values of our data points are in fixed intervals, they can also be arranged in a cut-down version. All we need is to define the starting point of time, pointStart, and the regular interval between them, pointInterval, in milliseconds: series: [{     name: 'Nasdaq',     color: '#4572A7',     pointStart: Date.UTC(2012, 4, 11, 9, 30),     pointInterval: 30 * 60 * 1000,     data: [ 2606.01, 2622.08, 2636.03, 2637.78,             2639.15, 2637.09, 2633.38, 2632.23,             2632.33, 2632.59, 2630.34, 2626.89,             2624.59, 2615.98 ] }] Adjusting intervals and background We have learned how to use axis categories and series data arrays in the last section. In this section, we will see how to format interval lines and the background style to produce a graph with more clarity. We will continue from the previous example. First, let's create some interval lines along the y axis. In the chart, the interval is automatically set to 20. However, it would be clearer to double the number of interval lines. To do that, simply assign the tickInterval value to 10. Then, we use minorTickInterval to put another line in between the intervals to indicate a semi-interval. In order to distinguish between interval and semi-interval lines, we set the semi-interval lines, minorGridLineDashStyle, to a dashed and dotted style. There are nearly a dozen line style settings available in Highcharts, from 'Solid' to 'LongDashDotDot'. Readers can refer to the online manual for possible values. The following is the first step to create the new settings:             yAxis: {                 title: {                     text: null                 },                 tickInterval: 10,                 minorTickInterval: 5,                 minorGridLineColor: '#ADADAD',                 minorGridLineDashStyle: 'dashdot'            } The interval lines should look like the following screenshot: To make the graph even more presentable, we add a striping effect with shading using alternateGridColor. Then, we change the interval line color, gridLineColor, to a similar range with the stripes. The following code snippet is added into the yAxis configuration:                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                   } The following is the graph with the new shading background: The next step is to apply a more professional look to the y axis line. We are going to draw a line on the y axis with the lineWidth property, and add some measurement marks along the interval lines with the following code snippet:                  lineWidth: 2,                  lineColor: '#92A8CD',                  tickWidth: 3,                  tickLength: 6,                  tickColor: '#92A8CD',                  minorTickLength: 3,                  minorTickWidth: 1,                  minorTickColor: '#D8D8D8' The tickWidth and tickLength properties add the effect of little marks at the start of each interval line. We apply the same color on both the interval mark and the axis line. Then we add the ticks minorTickLength and minorTickWidth into the semi-interval lines in a smaller size. This gives a nice measurement mark effect along the axis, as shown in the following screenshot: Now, we apply a similar polish to the xAxis configuration, as follows:            xAxis: {                type: 'datetime',                labels: {                    formatter: function() {                        return Highcharts.dateFormat('%I:%M %P', this.value);                    },                },                gridLineDashStyle: 'dot',                gridLineWidth: 1,                tickInterval: 60 * 60 * 1000,                lineWidth: 2,                lineColor: '#92A8CD',                tickWidth: 3,                tickLength: 6,                tickColor: '#92A8CD',            }, We set the x axis interval lines to the hourly format and switch the line style to a dotted line. Then, we apply the same color, thickness, and interval ticks as on the y axis. The following is the resulting screenshot: However, there are some defects along the x axis line. To begin with, the meeting point between the x axis and y axis lines does not align properly. Secondly, the interval labels at the x axis are touching the interval ticks. Finally, part of the first data point is covered by the y-axis line. The following is an enlarged screenshot showing the issues: There are two ways to resolve the axis line alignment problem, as follows: Shift the plot area 1 pixel away from the x axis. This can be achieved by setting the offset property of xAxis to 1. Increase the x-axis line width to 3 pixels, which is the same width as the y-axis tick interval. As for the x-axis label, we can simply solve the problem by introducing the y offset value into the labels setting. Finally, to avoid the first data point touching the y-axis line, we can impose minPadding on the x axis. What this does is to add padding space at the minimum value of the axis, the first point. The minPadding value is based on the ratio of the graph width. In this case, setting the property to 0.02 is equivalent to shifting along the x axis 5 pixels to the right (250 px * 0.02). The following are the additional settings to improve the chart:     xAxis: {         ....         labels: {                formatter: ...,                y: 17         },         .....         minPadding: 0.02,         offset: 1     } The following screenshot shows that the issues have been addressed: As we can see, Highcharts has a comprehensive set of configurable variables with great flexibility. Using plot lines and plot bands In this section, we are going to see how we can use Highcharts to place lines or bands along the axis. We will continue with the example from the previous section. Let's draw a couple of lines to indicate the day's highest and lowest index points on the y axis. The plotLines field accepts an array of object configurations for each plot line. There are no width and color default values for plotLines, so we need to specify them explicitly in order to see the line. The following is the code snippet for the plot lines:       yAxis: {               ... ,               plotLines: [{                    value: 2606.01,                    width: 2,                    color: '#821740',                    label: {                        text: 'Lowest: 2606.01',                        style: {                            color: '#898989'                        }                    }               }, {                    value: 2639.15,                    width: 2,                    color: '#4A9338',                    label: {                        text: 'Highest: 2639.15',                        style: {                            color: '#898989'                        }                    }               }]         } The following screenshot shows what it should look like: We can improve the look of the chart slightly. First, the text label for the top plot line should not be next to the highest point. Second, the label for the bottom line should be remotely covered by the series and interval lines, as follows: To resolve these issues, we can assign the plot line's zIndex to 1, which brings the text label above the interval lines. We also set the x position of the label to shift the text next to the point. The following are the new changes:              plotLines: [{                    ... ,                    label: {                        ... ,                        x: 25                    },                    zIndex: 1                    }, {                    ... ,                    label: {                        ... ,                        x: 130                    },                    zIndex: 1               }] The following graph shows the label has been moved away from the plot line and over the interval line: Now, we are going to change the preceding example with a plot band area that shows the index change between the market's opening and closing values. The plot band configuration is very similar to plot lines, except that it uses the to and from properties, and the color property accepts gradient settings or color code. We create a plot band with a triangle text symbol and values to signify a positive close. Instead of using the x and y properties to fine-tune label position, we use the align option to adjust the text to the center of the plot area (replace the plotLines setting from the above example):               plotBands: [{                    from: 2606.01,                    to: 2615.98,                    label: {                        text: '▲ 9.97 (0.38%)',                        align: 'center',                        style: {                            color: '#007A3D'                        }                    },                    zIndex: 1,                    color: {                        linearGradient: {                            x1: 0, y1: 1,                            x2: 1, y2: 1                        },                        stops: [ [0, '#EBFAEB' ],                                 [0.5, '#C2F0C2'] ,                                 [0.8, '#ADEBAD'] ,                                 [1, '#99E699']                        ]                    }               }] The triangle is an alt-code character; hold down the left Alt key and enter 30 in the number keypad. See http://www.alt-codes.net for more details. This produces a chart with a green plot band highlighting a positive close in the market, as shown in the following screenshot: Extending to multiple axes Previously, we ran through most of the axis configurations. Here, we explore how we can use multiple axes, which are just an array of objects containing axis configurations. Continuing from the previous stock market example, suppose we now want to include another market index, Dow Jones, along with Nasdaq. However, both indices are different in nature, so their value ranges are vastly different. First, let's examine the outcome by displaying both indices with the common y axis. We change the title, remove the fixed interval setting on the y axis, and include data for another series:             chart: ... ,             title: {                 text: 'Market Data: Nasdaq & Dow Jones'             },             subtitle: ... ,             xAxis: ... ,             credits: ... ,             yAxis: {                 title: {                     text: null                 },                 minorGridLineColor: '#D8D8D8',                 minorGridLineDashStyle: 'dashdot',                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                 },                 lineWidth: 2,                 lineColor: '#92A8CD',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#92A8CD',                 minorTickLength: 3,                 minorTickWidth: 1,                 minorTickColor: '#D8D8D8'             },             series: [{               name: 'Nasdaq',               color: '#4572A7',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                          [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                          ...                        ]             }, {               name: 'Dow Jones',               color: '#AA4643',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 12598.32 ],                          [ Date.UTC(2012, 4, 11, 10), 12538.61 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 12549.89 ],                          ...                        ]             }] The following is the chart showing both market indices: As expected, the index changes that occur during the day have been normalized by the vast differences in value. Both lines look roughly straight, which falsely implies that the indices have hardly changed. Let us now explore putting both indices onto separate y axes. We should remove any background decoration on the y axis, because we now have a different range of data shared on the same background. The following is the new setup for yAxis:            yAxis: [{                  title: {                     text: 'Nasdaq'                 },               }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true             }], Now yAxis is an array of axis configurations. The first entry in the array is for Nasdaq and the second is for Dow Jones. This time, we display the axis title to distinguish between them. The opposite property is to put the Dow Jones y axis onto the other side of the graph for clarity. Otherwise, both y axes appear on the left-hand side. The next step is to align indices from the y-axis array to the series data array, as follows:             series: [{                 name: 'Nasdaq',                 color: '#4572A7',                 yAxis: 0,                 data: [ ... ]             }, {                 name: 'Dow Jones',                 color: '#AA4643',                 yAxis: 1,                 data: [ ... ]             }]          We can clearly see the movement of the indices in the new graph, as follows: Moreover, we can improve the final view by color-matching the series to the axis lines. The Highcharts.getOptions().colors property contains a list of default colors for the series, so we use the first two entries for our indices. Another improvement is to set maxPadding for the x axis, because the new y-axis line covers parts of the data points at the high end of the x axis:             xAxis: {                 ... ,                 minPadding: 0.02,                 maxPadding: 0.02                 },             yAxis: [{                 title: {                     text: 'Nasdaq'                 },                 lineWidth: 2,                 lineColor: '#4572A7',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#4572A7'             }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true,                 lineWidth: 2,                 lineColor: '#AA4643',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#AA4643'             }], The following screenshot shows the improved look of the chart: We can extend the preceding example and have more than a couple of axes, simply by adding entries into the yAxis and series arrays, and mapping both together. The following screenshot shows a 4-axis line graph: Summary In this article, major configuration components were discussed and experimented with, and examples shown. By now, we should be comfortable with what we have covered already and ready to plot some of the basic graphs with more elaborate styles. Resources for Article: Further resources on this subject: Theming with Highcharts [article] Integrating with other Frameworks [article] Highcharts [article]
Read more
  • 0
  • 0
  • 9155
article-image-creating-map
Packt
29 Dec 2014
11 min read
Save for later

Creating a Map

Packt
29 Dec 2014
11 min read
In this article by Thomas Newton and Oscar Villarreal, authors of the book Learning D3.js Mapping, we will cover the following topics through a series of experiments: Foundation – creating your basic map Experiment 1 – adjusting the bounding box Experiment 2 – creating choropleths Experiment 3 – adding click events to our visualization (For more resources related to this topic, see here.) Foundation – creating your basic map In this section, we will walk through the basics of creating a standard map. Let's walk through the code to get a step-by-step explanation of how to create this map. The width and height can be anything you want. Depending on where your map will be visualized (cellphones, tablets, or desktops), you might want to consider providing a different width and height: var height = 600; var width = 900; The next variable defines a projection algorithm that allows you to go from a cartographic space (latitude and longitude) to a Cartesian space (x,y)—basically a mapping of latitude and longitude to coordinates. You can think of a projection as a way to map the three-dimensional globe to a flat plane. There are many kinds of projections, but geo.mercator is normally the default value you will use: var projection = d3.geo.mercator(); var mexico = void 0; If you were making a map of the USA, you could use a better projection called albersUsa. This is to better position Alaska and Hawaii. By creating a geo.mercator projection, Alaska would render proportionate to its size, rivaling that of the entire US. The albersUsa projection grabs Alaska, makes it smaller, and puts it at the bottom of the visualization. The following screenshot is of geo.mercator:   This following screenshot is of geo.albersUsa:   The D3 library currently contains nine built-in projection algorithms. An overview of each one can be viewed at https://github.com/mbostock/d3/wiki/Geo-Projections. Next, we will assign the projection to our geo.path function. This is a special D3 function that will map the JSON-formatted geographic data into SVG paths. The data format that the geo.path function requires is named GeoJSON: var path = d3.geo.path().projection(projection); var svg = d3.select("#map")    .append("svg")    .attr("width", width)    .attr("height", height); Including the dataset The necessary data has been provided for you within the data folder with the filename geo-data.json: d3.json('geo-data.json', function(data) { console.log('mexico', data); We get the data from an AJAX call. After the data has been collected, we want to draw only those parts of the data that we are interested in. In addition, we want to automatically scale the map to fit the defined height and width of our visualization. If you look at the console, you'll see that "mexico" has an objects property. Nested inside the objects property is MEX_adm1. This stands for the administrative areas of Mexico. It is important to understand the geographic data you are using, because other data sources might have different names for the administrative areas property:   Notice that the MEX_adm1 property contains a geometries array with 32 elements. Each of these elements represents a state in Mexico. Use this data to draw the D3 visualization. var states = topojson.feature(data, data.objects.MEX_adm1); Here, we pass all of the administrative areas to the topojson.feature function in order to extract and create an array of GeoJSON objects. The preceding states variable now contains the features property. This features array is a list of 32 GeoJSON elements, each representing the geographic boundaries of a state in Mexico. We will set an initial scale and translation to 1 and 0,0 respectively: // Setup the scale and translate projection.scale(1).translate([0, 0]); This algorithm is quite useful. The bounding box is a spherical box that returns a two-dimensional array of min/max coordinates, inclusive of the geographic data passed: var b = path.bounds(states); To quote the D3 documentation: "The bounding box is represented by a two-dimensional array: [[left, bottom], [right, top]], where left is the minimum longitude, bottom is the minimum latitude, right is maximum longitude, and top is the maximum latitude." This is very helpful if you want to programmatically set the scale and translation of the map. In this case, we want the entire country to fit in our height and width, so we determine the bounding box of every state in the country of Mexico. The scale is calculated by taking the longest geographic edge of our bounding box and dividing it by the number of pixels of this edge in the visualization: var s = .95 / Math.max((b[1][0] - b[0][0]) / width, (b[1][1] - b[0][1]) / height); This can be calculated by first computing the scale of the width, then the scale of the height, and, finally, taking the larger of the two. All of the logic is compressed into the single line given earlier. The three steps are explained in the following image:   The value 95 adjusts the scale, because we are giving the map a bit of a breather on the edges in order to not have the paths intersect the edges of the SVG container item, basically reducing the scale by 5 percent. Now, we have an accurate scale of our map, given our set width and height. var t = [(width - s * (b[1][0] + b[0][0])) / 2, (height - s * (b[1][1] + b[0][1])) / 2]; When we scale in SVG, it scales all the attributes (even x and y). In order to return the map to the center of the screen, we will use the translate function. The translate function receives an array with two parameters: the amount to translate in x, and the amount to translate in y. We will calculate x by finding the center (topRight – topLeft)/2 and multiplying it by the scale. The result is then subtracted from the width of the SVG element. Our y translation is calculated similarly but using the bottomRight – bottomLeft values divided by 2, multiplied by the scale, then subtracted from the height. Finally, we will reset the projection to use our new scale and translation: projection.scale(s).translate(t); Here, we will create a map variable that will group all of the following SVG elements into a <g> SVG tag. This will allow us to apply styles and better contain all of the proceeding paths' elements: var map = svg.append('g').attr('class', 'boundary'); Finally, we are back to the classic D3 enter, update, and exit pattern. We have our data, the list of Mexico states, and we will join this data to the path SVG element:    mexico = map.selectAll('path').data(states.features);      //Enter    mexico.enter()        .append('path')        .attr('d', path); The enter section and the corresponding path functions are executed on every data element in the array. As a refresher, each element in the array represents a state in Mexico. The path function has been set up to correctly draw the outline of each state as well as scale and translate it to fit in our SVG container. Congratulations! You have created your first map! Experiment 1 – adjusting the bounding box Now that we have our foundation, let's start with our first experiment. For this experiment, we will manually zoom in to a state of Mexico using what we learned in the previous section. For this experiment, we will modify one line of code: var b = path.bounds(states.features[5]); Here, we are telling the calculation to create a boundary based on the sixth element of the features array instead of every state in the country of Mexico. The boundaries data will now run through the rest of the scaling and translation algorithms to adjust the map to the one shown in the following screenshot:   We have basically reduced the min/max of the boundary box to include the geographic coordinates for one state in Mexico (see the next screenshot), and D3 has scaled and translated this information for us automatically:   This can be very useful in situations where you might not have the data that you need in isolation from the surrounding areas. Hence, you can always zoom in to your geography of interest and isolate it from the rest. Experiment 2 – creating choropleths One of the most common uses of D3.js maps is to make choropleths. This visualization gives you the ability to discern between regions, giving them a different color. Normally, this color is associated with some other value, for instance, levels of influenza or a company's sales. Choropleths are very easy to make in D3.js. In this experiment, we will create a quick choropleth based on the index value of the state in the array of all the states. We will only need to modify two lines of code in the update section of our D3 code. Right after the enter section, add the following two lines: //Update var color = d3.scale.linear().domain([0,33]).range(['red',   'yellow']); mexico.attr('fill', function(d,i) {return color(i)}); The color variable uses another valuable D3 function named scale. Scales are extremely powerful when creating visualizations in D3; much more detail on scales can be found at https://github.com/mbostock/d3/wiki/Scales. For now, let's describe what this scale defines. Here, we created a new function called color. This color function looks for any number between 0 and 33 in an input domain. D3 linearly maps these input values to a color between red and yellow in the output range. D3 has included the capability to automatically map colors in a linear range to a gradient. This means that executing the new function, color, with 0 will return the color red, color(15) will return an orange color, and color(33) will return yellow. Now, in the update section, we will set the fill property of the path to the new color function. This will provide a linear scale of colors and use the index value i to determine what color should be returned. If the color was determined by a different value of the datum, for instance, d.sales, then you would have a choropleth where the colors actually represent sales. The preceding code should render something as follows: Experiment 3 – adding click events to our visualization We've seen how to make a map and set different colors to the different regions of this map. Next, we will add a little bit of interactivity. This will illustrate a simple reference to bind click events to maps. First, we need a quick reference to each state in the country. To accomplish this, we will create a new function called geoID right below the mexico variable: var height = 600; var width = 900; var projection = d3.geo.mercator(); var mexico = void 0;   var geoID = function(d) {    return "c" + d.properties.ID_1; }; This function takes in a state data element and generates a new selectable ID based on the ID_1 property found in the data. The ID_1 property contains a unique numeric value for every state in the array. If we insert this as an id attribute into the DOM, then we would create a quick and easy way to select each state in the country. The following is the geoID function, creating another function called click: var click = function(d) {    mexico.attr('fill-opacity', 0.2); // Another update!    d3.select('#' + geoID(d)).attr('fill-opacity', 1); }; This method makes it easy to separate what the click is doing. The click method receives the datum and changes the fill opacity value of all the states to 0.2. This is done so that when you click on one state and then on the other, the previous state does not maintain the clicked style. Notice that the function call is iterating through all the elements of the DOM, using the D3 update pattern. After making all the states transparent, we will set a fill-opacity of 1 for the given clicked item. This removes all the transparent styling from the selected state. Notice that we are reusing the geoID function that we created earlier to quickly find the state element in the DOM. Next, let's update the enter method to bind our new click method to every new DOM element that enter appends: //Enter mexico.enter()      .append('path')      .attr('d', path)      .attr('id', geoID)      .on("click", click); We also added an attribute called id; this inserts the results of the geoID function into the id attribute. Again, this makes it very easy to find the clicked state. The code should produce a map as follows. Check it out and make sure that you click on any of the states. You will see its color turn a little brighter than the surrounding states. Summary You learned how to build many different kinds of maps that cover different kinds of needs. Choropleths and data visualizations on maps are some of the most common geographic-based data representations that you will come across. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 2100

article-image-evolution-hadoop
Packt
29 Dec 2014
12 min read
Save for later

Evolution of Hadoop

Packt
29 Dec 2014
12 min read
 In this article by Sandeep Karanth, author of the book Mastering Hadoop, we will see about the Hadoop's timeline, Hadoop 2.X and Hadoop YARN. Hadoop's timeline The following figure gives a timeline view of the major releases and milestones of Apache Hadoop. The project has been there for 8 years, but the last 4 years has seen Hadoop make giant strides in big data processing. In January 2010, Google was awarded a patent for the MapReduce technology. This technology was licensed to the Apache Software Foundation 4 months later, a shot in the arm for Hadoop. With legal complications out of the way, enterprises—small, medium, and large—were ready to embrace Hadoop. Since then, Hadoop has come up with a number of major enhancements and releases. It has given rise to businesses selling Hadoop distributions, support, training, and other services. Hadoop 1.0 releases, referred to as 1.X in this book, saw the inception and evolution of Hadoop as a pure MapReduce job-processing framework. It has exceeded its expectations with a wide adoption of massive data processing. The stable 1.X release at this point of time is 1.2.1, which includes features such as append and security. Hadoop 1.X tried to stay flexible by making changes, such as HDFS append, to support online systems such as HBase. Meanwhile, big data applications evolved in range beyond MapReduce computation models. The flexibility of Hadoop 1.X releases had been stretched; it was no longer possible to widen its net to cater to the variety of applications without architectural changes. Hadoop 2.0 releases, referred to as 2.X in this book, came into existence in 2013. This release family has major changes to widen the range of applications Hadoop can solve. These releases can even increase efficiencies and mileage derived from existing Hadoop clusters in enterprises. Clearly, Hadoop is moving fast beyond MapReduce to stay as the leader in massive scale data processing with the challenge of being backward compatible. It is becoming a generic cluster-computing and storage platform from being only a MapReduce-specific framework. Hadoop 2.X The extensive success of Hadoop 1.X in organizations also led to the understanding of its limitations, which are as follows: Hadoop gives unprecedented access to cluster computational resources to every individual in an organization. The MapReduce programming model is simple and supports a develop once deploy at any scale paradigm. This leads to users exploiting Hadoop for data processing jobs where MapReduce is not a good fit, for example, web servers being deployed in long-running map jobs. MapReduce is not known to be affable for iterative algorithms. Hacks were developed to make Hadoop run iterative algorithms. These hacks posed severe challenges to cluster resource utilization and capacity planning. Hadoop 1.X has a centralized job flow control. Centralized systems are hard to scale as they are the single point of load lifting. JobTracker failure means that all the jobs in the system have to be restarted, exerting extreme pressure on a centralized component. Integration of Hadoop with other kinds of clusters is difficult with this model. The early releases in Hadoop 1.X had a single NameNode that stored all the metadata about the HDFS directories and files. The data on the entire cluster hinged on this single point of failure. Subsequent releases had a cold standby in the form of a secondary NameNode. The secondary NameNode merged the edit logs and NameNode image files, periodically bringing in two benefits. One, the primary NameNode startup time was reduced as the NameNode did not have to do the entire merge on startup. Two, the secondary NameNode acted as a replica that could minimize data loss on NameNode disasters. However, the secondary NameNode (secondary NameNode is not a backup node for NameNode) was still not a hot standby, leading to high failover and recovery times and affecting cluster availability. Hadoop 1.X is mainly a Unix-based massive data processing framework. Native support on machines running Microsoft Windows Server is not possible. With Microsoft entering cloud computing and big data analytics in a big way, coupled with existing heavy Windows Server investments in the industry, it's very important for Hadoop to enter the Microsoft Windows landscape as well. Hadoop's success comes mainly from enterprise play. Adoption of Hadoop mainly comes from the availability of enterprise features. Though Hadoop 1.X tries to support some of them, such as security, there is a list of other features that are badly needed by the enterprise. Yet Another Resource Negotiator (YARN) In Hadoop 1.X, resource allocation and job execution were the responsibilities of JobTracker. Since the computing model was closely tied to the resources in the cluster, MapReduce was the only supported model. This tight coupling led to developers force-fitting other paradigms, leading to unintended use of MapReduce. The primary goal of YARN is to separate concerns relating to resource management and application execution. By separating these functions, other application paradigms can be added onboard a Hadoop computing cluster. Improvements in interoperability and support for diverse applications lead to efficient and effective utilization of resources. It integrates well with the existing infrastructure in an enterprise. Achieving loose coupling between resource management and job management should not be at the cost of loss in backward compatibility. For almost 6 years, Hadoop has been the leading software to crunch massive datasets in a parallel and distributed fashion. This means huge investments in development; testing and deployment were already in place. YARN maintains backward compatibility with Hadoop 1.X (hadoop-0.20.205+) APIs. An older MapReduce program can continue execution in YARN with no code changes. However, recompiling the older code is mandatory. Architecture overview The following figure lays out the architecture of YARN. YARN abstracts out resource management functions to a platform layer called ResourceManager (RM). There is a per-cluster RM that primarily keeps track of cluster resource usage and activity. It is also responsible for allocation of resources and resolving contentions among resource seekers in the cluster. RM uses a generalized resource model and is agnostic to application-specific resource needs. For example, RM need not know the resources corresponding to a single Map or Reduce slot. Planning and executing a single job is the responsibility of ApplicationMaster (AM). There is an AM instance per running application. For example, there is an AM for each MapReduce job. It has to request for resources from the RM, use them to execute the job, and work around failures, if any. The general cluster layout has RM running as a daemon on a dedicated machine with a global view of the cluster and its resources. Being a global entity, RM can ensure fairness depending on the resource utilization of the cluster resources. When requested for resources, RM allocates them dynamically as a node-specific bundle called a container. For example, 2 CPUs and 4 GB of RAM on a particular node can be specified as a container. Every node in the cluster runs a daemon called NodeManager (NM). RM uses NM as its node local assistant. NMs are used for container management functions, such as starting and releasing containers, tracking local resource usage, and fault reporting. NMs send heartbeats to RM. The RM view of the system is the aggregate of the views reported by each NM. Jobs are submitted directly to RMs. Based on resource availability, jobs are scheduled to run by RMs. The metadata of the jobs are stored in persistent storage to recover from RM crashes. When a job is scheduled, RM allocates a container for the AM of the job on a node in the cluster. AM then takes over orchestrating the specifics of the job. These specifics include requesting resources, managing task execution, optimizations, and handling tasks or job failures. AM can be written in any language, and different versions of AM can execute independently on a cluster. An AM resource request contains specifications about the locality and the kind of resource expected by it. RM puts in its best effort to satisfy AM's needs based on policies and availability of resources. When a container is available for use by AM, it can launch application-specific code in this container. The container is free to communicate with its AM. RM is agnostic to this communication. Storage layer enhancements A number of storage layer enhancements were undertaken in the Hadoop 2.X releases. The number one goal of the enhancements was to make Hadoop enterprise ready. High availability NameNode is a directory service for Hadoop and contains metadata pertaining to the files within cluster storage. Hadoop 1.X had a secondary Namenode, a cold standby that needed minutes to come up. Hadoop 2.X provides features to have a hot standby of NameNode. On the failure of an active NameNode, the standby can become the active Namenode in a matter of minutes. There is no data loss or loss of NameNode service availability. With hot standbys, automated failover becomes easier too. The key to keep the standby in a hot state is to keep its data as current as possible with respect to the active Namenode. This is achieved by reading the edit logs of the active NameNode and applying it onto itself with very low latency. The sharing of edit logs can be done using the following two methods: A shared NFS storage directory between the active and standby NameNodes: the active writes the logs to the shared location. The standby monitors the shared directory and pulls in the changes. A quorum of Journal Nodes: the active NameNode presents its edits to a subset of journal daemons that record this information. The standby node constantly monitors these journal daemons for updates and syncs the state with itself. The following figure shows the high availability architecture using a quorum of Journal Nodes. The data nodes themselves send block reports directly to both the active and standby NameNodes: Zookeeper or any other High Availability monitoring service can be used to track NameNode failures. With the assistance of Zookeeper, failover procedures to promote the hot standby as the active NameNode can be triggered. HDFS Federation Similar to what YARN did to Hadoop's computation layer, a more generalized storage model has been implemented in Hadoop 2.X. The block storage layer has been generalized and separated out from the filesystem layer. This separation has given an opening for other storage services to be integrated into a Hadoop cluster. Previously, HDFS and the block storage layer were tightly coupled. One use case that has come forth from this generalized storage model is HDFS Federation. Federation allows multiple HDFS namespaces to use the same underlying storage. Federated NameNodes provide isolation at the filesystem level. HDFS snapshots Snapshots are point-in-time, read-only images of the entire or a particular subset of a filesystem. Snapshots are taken for three general reasons: Protection against user errors Backup Disaster recovery Snapshotting is implemented only on NameNode. It does not involve copying data from the data nodes. It is a persistent copy of the block list and file size. The process of taking a snapshot is almost instantaneous and does not affect the performance of NameNode. Other enhancements There are a number of other enhancements in Hadoop 2.X, which are as follows: The wire protocol for RPCs within Hadoop is now based on Protocol Buffers. Previously, Java serialization via Writables was used. This improvement not only eases maintaining backward compatibility, but also aids in rolling the upgrades of different cluster components. RPCs allow for client-side retries as well. HDFS in Hadoop 1.X was agnostic about the type of storage being used. Mechanical or SSD drives were treated uniformly. The user did not have any control on data placement. Hadoop 2.X releases in 2014 are aware of the type of storage and expose this information to applications as well. Applications can use this to optimize their data fetch and placement strategies. HDFS append support has been brought into Hadoop 2.X. HDFS access in Hadoop 1.X releases has been through HDFS clients. In Hadoop 2.X, support for NFSv3 has been brought into the NFS gateway component. Clients can now mount HDFS onto their compatible local filesystem, allowing them to download and upload files directly to and from HDFS. Appends to files are allowed, but random writes are not. A number of I/O improvements have been brought into Hadoop. For example, in Hadoop 1.X, clients collocated with data nodes had to read data via TCP sockets. However, with short-circuit local reads, clients can directly read off the data nodes. This particular interface also supports zero-copy reads. The CRC checksum that is calculated for reads and writes of data has been optimized using the Intel SSE4.2 CRC32 instruction. Support enhancements Hadoop is also widening its application net by supporting other platforms and frameworks. One dimension we saw was onboarding of other computational models with YARN or other storage systems with the Block Storage layer. The other enhancements are as follows: Hadoop 2.X supports Microsoft Windows natively. This translates to a huge opportunity to penetrate the Microsoft Windows server land for massive data processing. This was partially possible because of the use of the highly portable Java programming language for Hadoop development. The other critical enhancement was the generalization of compute and storage management to include Microsoft Windows. As part of Platform-as-a-Service offerings, cloud vendors give out on-demand Hadoop as a service. OpenStack support in Hadoop 2.X makes it conducive for deployment in elastic and virtualized cloud environments. Summary In this article, we saw the evolution of Hadoop and some of its milestones and releases. We went into depth on Hadoop 2.X and the changes it brings into Hadoop. The key takeaways from this article are: In over 6 years of its existence, Hadoop has become the number one choice as a framework for massively parallel and distributed computing. The community has been shaping Hadoop to gear up for enterprise use. In 1.X releases, HDFS append and security, were the key features that made Hadoop enterprise-friendly. Hadoop's storage layer was enhanced in 2.X to separate the filesystem from the block storage service. This enables features such as supporting multiple namespaces and integration with other filesystems. 2.X shows improvements in Hadoop storage availability and snapshotting. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [article] Sizing and Configuring your Hadoop Cluster [article] HDFS and MapReduce [article]
Read more
  • 0
  • 0
  • 2679

article-image-analyzing-data
Packt
24 Dec 2014
13 min read
Save for later

Analyzing Data

Packt
24 Dec 2014
13 min read
In this article by Amarpreet Singh Bassan and Debarchan Sarkar, authors of Mastering SQL Server 2014 Data Mining, we will begin our discussion with an introduction to the data mining life cycle, and this article will focus on its first three stages. You are expected to have basic understanding of the Microsoft business intelligence stack and familiarity of terms such as extract, transform, and load (ETL), data warehouse, and so on. (For more resources related to this topic, see here.) Data mining life cycle Before going into further details, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps: Understanding the business requirement. Understanding the data. Preparing the data for the analysis. Preparing the data mining models. Evaluating the results of the analysis prepared with the models. Deploying the models to the SQL Server Analysis Services Server. Repeating steps 1 to 6 in case the business requirement changes. Let's look at each of these stages in detail. The first and foremost task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions: What and whom are we targeting? What is the outcome we are targeting? What is the time frame for which we have the data and what is the target time period that our data is going to forecast? What would the success measures look like? Let's define a classic problem and understand more about the preceding questions. We can use them to discuss how to extract the information rather than spending our time on defining the schema. Consider an instance where you are a salesman for the AdventureWorks Cycle company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail. The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely: What is it that we are predicting? (for example: customers, product sales, and so on) What is the time period of the data that we are selecting for prediction? What time period are we going to have the prediction for? What is the expected outcome of the prediction exercise? From the marketing point of view, several follow-up questions that must be answered are as follows: What is our target for marketing, a new product or an older product? Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification? On what timeline in the past is our marketing going to be based on? We might observe that there are many questions that overlap the two categories and therefore, there is an opportunity to consolidate the questions and classify them as follows: What is the population that we are targeting? What are the factors that we will actually be looking at? What is the time period of the past data that we will be looking at? What is the time period in the future that we will be considering the data mining results for? Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement. What is the population that we are targeting? The target population might be classified according to the following aspects: Age Salary Number of kids What are the factors that we are actually looking at? They might be classified as follows: Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes. Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes. Affinity of components: The people who tend to buy bikes would also buy some accessories. What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look; for example, we can look at the data for the past year, past two years, or past five years. We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so does people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past 5 years data can be valid for the upcoming 2 or 3 years depending upon the results that we get. Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks has several stores in various locations and based on the location, we would like to get an insight on the following: Which products should be stocked where? Which products should be stocked together? How much of the products should be stocked? What is the trend of sales for a new product in an area? It is not necessary that we will get answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions. Staging data In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems to a centralized repository. Extract, transform, and load This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts: Extract: The data is extracted from a different source database and other databases that might contain the information that we seek Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on Load: The transformed data is loaded into the destination data store database We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure: Data warehouse As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices and then relate to these concepts with the help of AdventureWorksDW database. Measures and dimensions There are a few common terminologies you will encounter as you enter the world of data warehousing. They are as follows: Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension Schema A schema, basically, determines the relationship of the various entities with each other. There are essentially two types of schema, namely: Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot: Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema usually would involve hierarchies, as shown in the following screenshot: Data mart While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts. Let's consider the sales for AdventureWorks cycles. To make any predictions on the sales of AdventureWorks, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart. The Internet sales facts table has the following data: [ProductKey][OrderDateKey][DueDateKey][ShipDateKey][CustomerKey][PromotionKey][CurrencyKey][SalesTerritoryKey][SalesOrderNumber][SalesOrderLineNumber][RevisionNumber][OrderQuantity][UnitPrice][ExtendedAmount][UnitPriceDiscountPct][DiscountAmount][ProductStandardCost][TotalProductCost][SalesAmount][TaxAmt][Freight][CarrierTrackingNumber][CustomerPONumber][OrderDate][DueDate][ShipDate] From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following: Product: This provides the following data: [ProductKey][ListPrice] Date: This provides the following data: [DateKey] Customer: This provides the following data: [CustomerKey] Currency: This provides the following data: [CurrencyKey] Sales territory: This provides the following data: [SalesTerritoryKey] The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and hence, we can easily perform all the analysis pertaining to the sales of the organization. Refreshing data Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data. Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table: [BusinessEntityID],[NationalIDNumber],[LoginID],[OrganizationNode],[OrganizationLevel],[JobTitle],[BirthDate],[MaritalStatus],[Gender],[HireDate],[SalariedFlag],[VacationHours],[SickLeaveHours],[CurrentFlag],[rowguid],[ModifiedDate] Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions, prior to SQL 2008. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely: Type 1: The older values are overwritten by new values Type 2: A new row specifying the present value for the dimension is inserted Type 3: The column specifying TimeStamp from which the new value is effective is updated Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows: Create table employee_load_status(HireDate DateTime,LoadStatus varchar); The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database: With employee_loaded_date(HireDate) as(select ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='success'Union AllSelect ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='failed')Insert into DimEmployee select * from employee where HireDate>=(select Min(HireDate) from employee_loaded_date); This will reload all the data from the date of the first failure till the present day. A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks table, we see the following columns: [BusinessEntityID],[TerritoryID],[SalesQuota],[Bonus],[CommissionPct],[SalesYTD],[SalesLastYear],[rowguid],[ModifiedDate] The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column. Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on. Let's look at this with a simpler example. Consider we have the following sales table: Name SalesAmount Date Rama 1000 11-02-2014 Shyama 2000 11-02-2014 Consider we have the following fact table: id SalesAmount Datekey We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this: id SalesAmount Datekey ra 1000 11-02-2014 Sh 2000 11-02-2014 Ra 4000 12-02-2014 Sh 5000 13-02-2014 Also, a complete reload will appear as shown here: id TotalSalesAmount Datekey Ra 5000 12-02-2014 Sh 7000 13-02-2014 Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria. Summary In this article, we've covered the first three steps of any data mining process. We've considered the reasons why we would want to undertake a data mining activity and identified the goal we have in mind. We then looked to stage the data and cleanse it. Resources for Article: Further resources on this subject: Hadoop and SQL [Article] SQL Server Analysis Services – Administering and Monitoring Analysis Services [Article] SQL Server Integration Services (SSIS) [Article]
Read more
  • 0
  • 0
  • 2929
article-image-replication
Packt
24 Dec 2014
5 min read
Save for later

Cassandra High Availability: Replication

Packt
24 Dec 2014
5 min read
This article by Robbie Strickland, the author of Cassandra High Availability, describes the data replication architecture used in Cassandra. Replication is perhaps the most critical feature of a distributed data store, as it would otherwise be impossible to make any sort of availability guarantee in the face of a node failure. As you already know, Cassandra employs a sophisticated replication system that allows fine-grained control over replica placement and consistency guarantees. In this article, we'll explore Cassandra's replication mechanism in depth. Let's start with the basics: how Cassandra determines the number of replicas to be created and where to locate them in the cluster. We'll begin the discussion with a feature that you'll encounter the very first time you create a keyspace: the replication factor. (For more resources related to this topic, see here.) The replication factor On the surface, setting the replication factor seems to be a fundamentally straightforward idea. You configure Cassandra with the number of replicas you want to maintain (during keyspace creation), and the system dutifully performs the replication for you, thus protecting you when something goes wrong. So by defining a replication factor of three, you will end up with a total of three copies of the data. There are a number of variables in this equation. Let's start with the basic mechanics of setting the replication factor. Replication strategies One thing you'll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster. There are two strategies available: SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads). NetworkTopologyStrategy: This strategy is used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster. SimpleStrategy As a way of introducing this concept, we'll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a keyspace called AddressBook with three replicas: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'SimpleStrategy',   'replication_factor' : 3}; The data is assigned to a node via a hash algorithm, resulting in each node owning a range of data. Let's take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm. The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows: Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary. Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal. This means reads and writes can be made to any node that holds a replica of the requested key. If you have a small cluster where all nodes reside in a single rack inside one data center, SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section). For production clusters, however, it is highly recommended that you use NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount. NetworkTopologyStrategy When it's time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose: Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines. Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration. Here's a basic example of a keyspace using NetworkTopologyStrategy: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'NetworkTopologyStrategy',   'dc1' : 3,   'dc2' : 2}; In this example, we're telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. Summary In this article, we introduced the foundational concepts of replication and consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. By now, you should be able to make sound decisions specific to your use cases. This article might serve as a handy reference in the future as it can be challenging to keep all these details in mind. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 2972

article-image-hadoop-and-sql
Packt
23 Dec 2014
61 min read
Save for later

Hadoop and SQL

Packt
23 Dec 2014
61 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. However, it does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop SQL. (For more resources related to this topic, see here.) In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views How HiveQL allows the incorporation of user-defined functions into its queries How SQL on Hadoop complements Pig Other SQL-on-Hadoop products such as Impala and how they differ from Hive Why SQL on Hadoop Until now, we saw how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008, Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of user-defined functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. Later in this article, we will also discuss Impala, released in 2013 and already a very popular tool, particularly for low-latency queries. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop, however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that is common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Have a look at the following code: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Even though the classic CLI tool for Hive was the tool with the same name, it is specifically called hive (all lowercase); recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process, as follows: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweet defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by a tab character t and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Neither the CREATE TABLE nor LOAD DATA statements don't truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. MapReduce jobs tend to have high latency and overhead derived from submission and scheduling. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. The following SHOW [DATABASES, TABLES, VIEWS] statement displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we might want to create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table, which didn't exist in older files, then, while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format. Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: These will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: These classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The structure is quite self-explanatory. The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the ddl, we told Hive that data is stored in the Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we want to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, for example, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; We will see how Avro and avro.schema.url play an instrumental role in enabling schema migrations. Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in SequenceFile, each full row and all its columns will be read from the disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnarchanged this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 This allows us to identify the number of tweets, 7,091, with no user object. We can improve the readability of the hive output by setting the following code: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. Alternatively, we can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10;   Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries. Historically, only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or with needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps, the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition in which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column, here we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table as in the preceding code. We use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance, by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; Using the MSCK REPAIR TABLE <table_name>; statement, all metadata for all partitions not currently present in the metastore will be added. On EMR, this is equivalent to executing the following code: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. In the following article, we will see how this pattern can be exploited to create flexible and interoperable pipelines. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. The following code is representative of this case: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Hive and Amazon Web Services With ElasticMapReduce as the AWS Hadoop-on-demand service, it is of course possible to run Hive on an EMR cluster. But it is also possible to use Amazon storage services, particularly S3, from any Hadoop cluster, be it within EMR or your own local cluster. Hive and S3 It is possible to specify a default filesystem other than HDFS for Hadoop and S3 is one option. But, it doesn't have to be an all-or-nothing thing; it is possible to have specific tables stored in S3. The data for these tables will be retrieved into the cluster to be processed, and any resulting data can either be written to a different S3 location (the same table cannot be the source and destination of a single query) or onto HDFS. We can take a file of our tweet data and place it onto a location in S3 with a command such as the following: $ aws s3 put tweets.tsv s3://<bucket-name>/tweets/ We firstly need to specify the access key and secret access key that can access the bucket. This can be done in three ways: Set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey to the appropriate values in the Hive CLI Set the same values in hive-site.xml though note this limits use of S3 to a single set of credentials Specify the table location explicitly in the table URL, that is, s3n://<access key>:<secret access key>@<bucket>/<path> Then we can create a table referencing this data, as follows: CREATE table remote_tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://<bucket-name>/tweets' This can be an incredibly effective way of pulling S3 data into a local Hadoop cluster for processing. In order to use AWS credentials in the URI of an S3 location regardless of how the parameters are passed, the secret and access keys must not contain /, +, =, or characters. If necessary, a new set of credentials can be generated from the IAM console at https://console.aws.amazon.com/iam/. In theory, you can just leave the data in the external table and refer to it when needed to avoid WAN data transfer latencies (and costs), even though it often makes sense to pull the data into a local table and do future processing from there. If the table is partitioned, then you might find yourself retrieving a new partition each day, for example. Hive on ElasticMapReduce On one level, using Hive within Amazon ElasticMapReduce is just the same as everything discussed in this article. You can create a persistent cluster, log in to the master node, and use the Hive CLI to create tables and submit queries. Doing all this will use the local storage on the EC2 instances for the table data. Not surprisingly, jobs on EMR clusters can also refer to tables whose data is stored on S3 (or DynamoDB). And, not surprisingly, Amazon has made extensions to its version of Hive to make all this very seamless. It is quite simple from within an EMR job to pull data from a table stored in S3, process it, write any intermediate data to the EMR local storage, and then write the output results into S3, DynamoDB, or one of a growing list of other AWS services. The pattern mentioned earlier where new data is added to a new partition directory for a table each day has proved very effective in S3; it is often the storage location of choice for large and incrementally growing datasets. There is a syntax difference when using EMR; instead of the MSCK command mentioned earlier, the command to update a Hive table with new data added to a partition directory is as follows: ALTER TABLE <table-name> RECOVER PARTITIONS; Consult the EMR documentation for the latest enhancements at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html. Also, consult the broader EMR documentation. In particular, the integration points with other AWS services is an area of rapid growth. Extending HiveQL The HiveQL language can be extended by means of plugins and third-party functions. In Hive, there are three types of functions characterized by the number of rows they take as input and produce as output: User Defined Functions (UDFs): These are simpler functions that act on one row at a time. User Defined Aggregate Functions (UDAFs): These functions take multiple rows as input and generate multiple rows as output. These are aggregate functions to be used in conjunction with a GROUP BY statement (similar to COUNT(), AVG(), MIN(), MAX(), and so on). User Defined Table Functions (UDTFs): These take multiple rows as input and generate a logical table comprised of multiple rows that can be used in join expressions. These APIs are provided only in Java. For other languages, it is possible to stream data through a user-defined script using the TRANSFORM, MAP, and REDUCE clauses that act as a frontend to Hadoop's streaming capabilities. Two APIs are available to write UDFs. A simple API org.apache.hadoop.hive.ql.exec.UDF can be used for functions that take read and return basic writable types. A richer API, which provides support for data types other than writable is available in the org.apache.hadoop.hive.ql.udf.generic.GenericUDF package. We'll now illustrate how org.apache.hadoop.hive.ql.exec.UDF can be used to implement a string to ID function similar to the one we used in Iterative Computation with Spark, to map hashtags to integers in Pig. Building a UDF with this API only requires extending the UDF class and writing an evaluate() method, as follows: public class StringToInt extends UDF {    public Integer evaluate(Text input) {        if (input == null)            return null;            String str = input.toString();          return str.hashCode();    } } The function takes a Text object as input and maps it to an integer value with the hashCode() method. The source code of this function can be found at https://github.com/learninghadoop2/book-examples/ch7/udf/ com/learninghadoop2/hive/udf/StringToInt.java. A more robust hash function should be used in production. We compile the class and archive it into a JAR file, as follows: $ javac -classpath $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/udf/StringToInt.java $ jar cvf myudfs-hive.jar com/learninghadoop2/hive/udf/StringToInt.class Before being able to use it, a UDF must be registered in Hive with the following commands: ADD JAR myudfs-hive.jar; CREATE TEMPORARY FUNCTION string_to_int AS 'com.learninghadoop2.hive.udf.StringToInt'; The ADD JAR statement adds a JAR file to the distributed cache. The CREATE TEMPORARY FUNCTION <function> AS <class> statement registers as a function in Hive that implements a given Java class. The function will be dropped once the Hive session is closed. As of Hive 0.13, it is possible to create permanent functions whose definition is kept in the metastore using CREATE FUNCTION … . Once registered, StringToInt can be used in a Hive query just as any other function. In the following example, we first extract a list of hashtags from the tweet's text by applying the regexp_extract function. Then, we use string_to_int to map each tag to a numerical ID: SELECT unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag) AS tag_id FROM    (        SELECT regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') as hashtag        FROM tweets        GROUP BY regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); We can use the preceding query to create a lookup table, as follows: CREATE TABLE lookuptable (tag string, tag_id bigint); INSERT OVERWRITE TABLE lookuptable SELECT unique_hashtags.hashtag,    string_to_int(unique_hashtags.hashtag) as tag_id FROM (    SELECT regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') AS hashtag          FROM tweets          GROUP BY regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)')    ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); Programmatic interfaces In addition to the hive and beeline command-line tools, it is possible to submit HiveQL queries to the system via the JDBC and Thrift programmatic interfaces. Support for odbc was bundled in older versions of Hive, but as of Hive 0.12, it needs to be built from scratch. More information on this process can be found at https://cwiki.apache.org/confluence/display/Hive/HiveODBC. JDBC A Hive client written using JDBC APIs looks exactly the same as a client program written for other database systems (for example MySQL). The following is a sample Hive client program using JDBC APIs. The source code for this example can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveJdbcClient.java. public class HiveJdbcClient {      private static String driverName = " org.apache.hive.jdbc.HiveDriver";           // connection string      public static String URL = "jdbc:hive2://localhost:10000";        // Show all tables in the default database      public static String QUERY = "show tables";        public static void main(String[] args) throws SQLException {          try {                Class.forName (driverName);          }          catch (ClassNotFoundException e) {                e.printStackTrace();                System.exit(1);          }          Connection con = DriverManager.getConnection (URL);          Statement stmt = con.createStatement();                   ResultSet resultSet = stmt.executeQuery(QUERY);          while (resultSet.next()) {                System.out.println(resultSet.getString(1));          }    } } The URL part is the JDBC URI that describes the connection end point. The format for establishing a remote connection is jdbc:hive2:<host>:<port>/<database>. Connections in embedded mode can be established by not specifying a host or port jdbc:hive2://. The hive and hive2 part are the drivers to be used when connecting to HiveServer and HiveServer2. The QUERY statement contains the HiveQL query to be executed. Hive's JDBC interface exposes only the default database. In order to access other databases, you need to reference them explicitly in the underlying queries using the <database>.<table> notation. First we load the HiveServer2 JDBC driver org.apache.hive.jdbc.HiveDriver. Use org.apache.hadoop.hive.jdbc.HiveDriver to connect to HiveServer. Then, like with any other JDBC program, we establish a connection to URL and use it to instantiate a Statement class. We execute QUERY, with no authentication, and store the output dataset into the ResultSet object. Finally, we scan resultSet and print its content to the command line. Compile and execute the example with the following commands: $ javac HiveJdbcClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar: com.learninghadoop2.hive.client.HiveJdbcClient Thrift Thrift provides lower-level access to Hive and has a number of advantages over the JDBC implementation of HiveServer. Primarily, it allows multiple connections from the same client, and it allows programming languages other than Java to be used with ease. With HiveServer2, it is a less commonly used option but still worth mentioning for compatibility. A sample Thrift client implemented using the Java API can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveThriftClient.java. This client can be used to connect to HiveServer, but due to protocol differences, the client won't work with HiveServer2. In the example, we define a getClient() method that takes as input the host and port of a HiveServer service and returns an instance of org.apache.hadoop.hive.service.ThriftHive.Client. A client is obtained by first instantiating a socket connection, org.apache.thrift.transport.TSocket, to the HiveServer service, and by specifying a protocol, org.apache.thrift.protocol.TBinaryProtocol, to serialize and transmit data, as follows:        TSocket transport = new TSocket(host, port);        transport.setTimeout(TIMEOUT);        transport.open();        TBinaryProtocol protocol = new TBinaryProtocol(transport);        client = new ThriftHive.Client(protocol); Finally, we call getClient() from the main method and use the client to execute a query against an instance of HiveServer running on localhost on port 11111, as follows:      public static void main(String[] args) throws Exception {          Client client = getClient("localhost", 11111);          client.execute("show tables");          List<String> results = client.fetchAll();           for (String result : results) { System.out.println(result);           }      } Make sure that HiveServer is running on port 11111, and if not, start an instance with the following command: $ sudo hive --service hiveserver -p 11111 Compile and execute the HiveThriftClient.java example with the following command: $ javac $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/client/HiveThriftClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*: com.learninghadoop2.hive.client.HiveThriftClient Stinger initiative Hive has remained very successful and capable since its earliest releases, particularly in its ability to provide SQL-like processing on enormous datasets. But other technologies did not stand still, and Hive acquired a reputation of being relatively slow, particularly in regard to lengthy startup times on large jobs and its inability to give quick responses to conceptually simple queries. These perceived limitations were less due to Hive itself and more a consequence of how translation of SQL queries into the MapReduce model has much built-in inefficiency when compared to other ways of implementing a SQL query. Particularly, in regard to very large datasets, MapReduce saw lots of I/OI/O (and consequently time) spent writing out the results of one MapReduce job just to have them read by another. Processing - MapReduce and Beyond, this is a major driver in the design of Tez, which can schedule tasks on a Hadoop cluster as a graph of tasks that does not require inefficient writes and reads between tasks in the graph. The following is a query on the MapReduce framework versus Tez: SELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country; The following figure contrasts the execution plan for the preceding query on the MapReduce framework versus Tez: Hive on MapReduce versus Tez In plain MapReduce, two jobs are created for the GROUP BY and JOIN clauses. The first job is composed of a set of MapReduce tasks that read data from the disk to carry out grouping. The reducers write intermediate results to the disk so that output can be synchronized. The mappers in the second job read the intermediate results from the disk as well as data from table b. The combined dataset is then passed to the reducer where shared keys are joined. Were we to execute an ORDER BY statement, this would have resulted in a third job and further MapReduce passes. The same query is executed on Tez as a single job by a single set of Map tasks that read data from the disk. I/O grouping and joining are pipelined across reducers. Alongside these architectural limitations, there were quite a few areas around SQL language support that could also provide better efficiency, and in early 2013, the Stinger initiative was launched with an explicit goal of making Hive over 100 times as fast and with much richer SQL support. Hive 0.13 has all the features of the three phases of Stinger, resulting in a much more complete SQL dialect. Also, Tez is offered as an execution framework in addition to a more efficient MapReduce-based implementation atop YARN. With Tez as the execution engine, Hive is no longer limited to a series of linear MapReduce jobs and can instead build a processing graph where any given step can, for example, stream results to multiple sub-steps. To take advantage of the Tez framework, there is a new Hive variable setting, as follows: set hive.execution.engine=tez; This setting relies on Tez being installed on the cluster; it is available in source form from http://tez.incubator.apache.org or in several distributions, though at the time of writing, not Cloudera, due to its support of Impala. The alternative value is mr, which uses the classic MapReduce model (atop YARN), so it is possible in a single installation to compare the performance of Hive using Tez. Impala Hive is not the only product providing the SQL-on-Hadoop capability. The second most widely used is likely Impala, announced in late 2012 and released in spring 2013. Though originally developed internally within Cloudera, its source code is periodically pushed to an open source Git repository (https://github.com/cloudera/impala). Impala was created out of the same perception of Hive's weaknesses that led to the Stinger initiative. Impala also took some inspiration from Google Dremel (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) which was first openly described by a paper published in 2009. Dremel was built at Google to address the gap between the need for very fast queries on very large datasets and the high latency inherent in the existing MapReduce model underpinning Hive at the time. Dremel was a sophisticated approach to this problem that, rather than building mitigations atop MapReduce such as implemented by Hive, instead created a new service that accessed the same data stored in HDFS. Dremel also benefited from significant work to optimize the storage format of the data in a way that made it more amenable to very fast analytic queries. The architecture of Impala The basic architecture has three main components; the Impala daemons, the state store, and the clients. Recent versions have added additional components that improve the service, but we'll focus on the high-level architecture. The Impala daemon (impalad) should be run on each host where a DataNode process is managing HDFS data. Note that impalad does not access the filesystem blocks through the full HDFS FileSystem API; instead, it uses a feature called short-circuit reads to make data access more efficient. When a client submits a query, it can do so to any of the running impalad processes, and this one will become the coordinator for the execution of that query. The key aspect of Impala's performance is that for each query, it generates custom native code, which is then pushed to and executed by all the impalad processes on the system. This highly optimized code performs the query on the local data, and each impalad then returns its subset of the result set to the coordinator node, which performs the final data consolidation to produce the final result. This type of architecture should be familiar to anyone who has worked with any of the (usually commercial and expensive) Massively Parallel Processing (MPP) (the term used for this type of shared scale-out architecture) data warehouse solutions available today. As the cluster runs the state, the store ensures that each impalad process is aware of all the others and provides a view of the overall cluster health. Co-existing with Hive Impala, as a newer product, tends to have a more restricted set of SQL data types and supports a more constrained dialect of SQL than Hive. It is, however, expanding this support with each new release. Refer to the Impala documentation (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html) to get an overview of the current level of support. Impala supports the Hive metastore mechanism used by Hive as a store of the metadata surrounding its table structure and storage. This means that on a cluster with an existing Hive setup, it should be immediately possible to use Impala as it will access the same metastore and therefore provide access to the same tables available in Hive. But be warned that the differences in SQL dialect and data types might cause unexpected results when working in a combined Hive and Impala environment. Some queries might work on one but not the other, they might show very different performance characteristics (more on this later), or they might actually give different results. This last point might become apparent when using data types such as float and double that are simply treated differently in the underlying systems (Hive is implemented on Java while Impala is written in C++). As of version 1.2, it supports UDFs written both in C++ and Java, although C++ is strongly recommended as a much faster solution. Keep this in mind if you are looking to share custom functions between Hive and Impala. A different philosophy When Impala was first released, its greatest benefit was in how it truly enabled what is often called speed of thought analysis. Queries could be returned sufficiently fast that an analyst could explore a thread of analysis in a completely interactive fashion without having to wait for minutes at a time for each query to complete. It's fair to say that most adopters of Impala were at times stunned by its performance, especially when compared to the version of Hive shipping at the time. The Impala focus has remained mostly on these shorter queries, and this does impose some limitations on the system. Impala tends to be quite memory-heavy as it relies on in-memory processing to achieve much of its performance. If a query requires a dataset to be held in memory rather than being available on the executing node, then that query will simply fail in versions of Impala before 2.0. Comparing the work on Stinger to Impala, it could be argued that Impala has a much stronger focus on excelling in the shorter (and arguably more common) queries that support interactive data analysis. Many business intelligence tools and services are now certified to directly run on Impala. The Stinger initiative has put less effort into making Hive just as fast in the area where Impala excels but has instead improved Hive (to varying degrees) for all workloads. Impala is still developing at a fast pace and Stinger has put additional momentum into Hive, so it is most likely wise to consider both products and determine which best meets the performance and functionality requirements of your projects and workflows. It should also be kept in mind that there are competitive commercial pressures shaping the direction of Impala and Hive. Impala was created and is still driven by Cloudera, the most popular vendor of Hadoop distributions. The Stinger initiative, though contributed to by many companies as diverse as Microsoft (yes, really!) and Intel, was lead by Hortonworks, probably the second largest vendor of Hadoop distributions. The fact is that if you are using the Cloudera distribution of Hadoop, then some of the core features of Hive might be slower to arrive, whereas Impala will always be up-to-date. Conversely, if you use another distribution, you might get the latest Hive release, but that might either have an older Impala or, as is currently the case, you might have to download and install it yourself. A similar situation has arisen with the Parquet and ORC file formats mentioned earlier. Parquet is preferred by Impala and developed by a group of companies led by Cloudera, while ORC is preferred by Hive and is championed by Hortonworks. Unfortunately, the reality is that Parquet support is often very quick to arrive in the Cloudera distribution but less so in say the Hortonworks distribution, where the ORC file format is preferred. These themes are a little concerning since, although competition in this space is a good thing, and arguably the announcement of Impala helped energize the Hive community, there is a greater risk that your choice of distribution might have a larger impact on the tools and file formats that will be fully supported, unlike in the past. Hopefully, the current situation is just an artifact of where we are in the development cycles of all these new and improved technologies, but do consider your choice of distribution carefully in relation to your SQL-on-Hadoop needs. Drill, Tajo, and beyond You should also consider that SQL on Hadoop no longer only refers to Hive or Impala. Apache Drill (http://drill.incubator.apache.org) is a fuller implementation of the Dremel model first described by Google. Although Impala implements the Dremel architecture across HDFS data, Drill looks to provide similar functionality across multiple data sources. It is still in its early stages, but if your needs are broader than what Hive or Impala provides, it might be worth considering. Tajo (http://tajo.apache.org) is another Apache project that seeks to be a full data warehouse system on Hadoop data. With an architecture similar to that of Impala, it offers a much richer system with components such as multiple optimizers and ETL tools that are commonplace in traditional data warehouses but less frequently bundled in the Hadoop world. It has a much smaller user base but has been used by certain companies very successfully for a significant length of time, and might be worth considering if you need a fuller data warehousing solution. Other products are also emerging in this space, and it's a good idea to do some research. Hive and Impala are awesome tools, but if you find that they don't meet your needs, then look around—something else might. Summary In its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL supports many standard SQL data types and commands including joins and views The ETL-like features offered by HiveQL, including the ability to import data into tables and optimize the table structure through partitioning and similar mechanisms How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez The broader ecosystem around HiveQL that now includes products such as Impala, Tajo, and Drill and how each of these focuses on specific areas in which to excel With Pig and Hive, we've introduced alternative models to process MapReduce data, but so far we've not looked at another question: what approaches and tools are required to actually allow this massive dataset being collected in Hadoop to remain useful and manageable over time? In the next article, we'll take a slight step up the abstraction hierarchy and look at how to manage the life cycle of this enormous data asset. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 6611
Modal Close icon
Modal Close icon