Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-structural-equation-modeling-and-confirmatory-factor-analysis
Packt
06 Feb 2015
30 min read
Save for later

Structural Equation Modeling and Confirmatory Factor Analysis

Packt
06 Feb 2015
30 min read
In this article by Paul Gerrard and Radia M. Johnson, the authors of Mastering Scientific Computation with R, we'll discuss the fundamental ideas underlying structural equation modeling, which are often overlooked in other books discussing structural equation modeling (SEM) in R, and then delve into how SEM is done in R. We will then discuss two R packages, OpenMx and lavaan. We can directly apply our discussion of the linear algebra underlying SEM using OpenMx. Because of this, we will go over OpenMx first. We will then discuss lavaan, which is probably more user friendly because it sweeps the matrices and linear algebra representations under the rug so that they are invisible unless the user really goes looking for them. Both packages continue to be developed and there will always be some features better supported in one of these packages than in the other. (For more resources related to this topic, see here.) SEM model fitting and estimation methods To ultimately find a good solution, software has to use trial and error to come up with an implied covariance matrix that matches the observed covariance matrix as well as possible. The question is what does "as well as possible" mean? The answer to this is that the software must try to minimize some particular criterion, usually some sort of discrepancy function. Just what that criterion is depends on the estimation method used. The most commonly used estimation methods in SEM include: Ordinary least squares (OLS) also called unweighted least squares Generalized least squares (GLS) Maximum likelihood (ML) There are a number of other estimation methods as well, some of which can be done in R, but here we will stick with describing the most common ones. In general, OLS is the simplest and computationally cheapest estimation method. GLS is computationally more demanding, and ML is computationally more intensive. We will see why this is, as we discuss the details of these estimation methods. Any SEM estimation method seeks to estimate model parameters that recreate the observed covariance matrix as well as possible. To evaluate how closely an implied covariance matrix matches an observed covariance matrix, we need a discrepancy function. If we assume multivariate normality of the observed variables, the following function can be used to assess discrepancy: In the preceding figure, R is the observed covariance matrix, C is the implied covariance matrix, and V is a weight matrix. The tr function refers to the trace function, which sums the elements of the main diagonal. The choice of V varies based on the SEM estimation method: For OLS, V = I For GLS, V = R-1 In the case of an ML estimation, we seek to minimize one of a number of similar criteria to describe ML, as follows: In the preceding figure, n is the number of variables. There are a couple of points worth noting here. GLS estimation inverts the observed correlation matrix, something computationally demanding with large matrices, but something that must only be done once. Alternatively, ML requires inversion of the implied covariance matrix, which changes with each iteration. Thus, each iteration requires the computationally demanding step of matrix inversion. With modern fast computers, this difference may not be noticeable, but with large SEM models, this might start to be quite time-consuming. Assessing SEM model fit The final question in an SEM model is how well the model explains the data. This is answered with the use of SEM measures of fit. Most of these measures are based on a chi-squared distribution. The fit criteria for GLS and ML (as well as a number of other estimation procedures such as asymptotic distribution-free methods) multiplied by N-1 is approximately chi-square distributed. Here, the capital N represents the number of observations in the dataset, as opposed to lower case n, which gives the number of variables. We compute degrees of freedom as the difference between the number of estimated parameters and the number of known covariances (that is, the total number of values in one triangle of an observed covariance matrix). This gives way to the first test statistic for SEM models, a chi-squared significance level comparing our chi-square value to some minimum chi-square threshold to achieve statistical significance. As with conventional chi-square testing, a chi-square value that is higher than some minimal threshold will reject the null hypothesis. Most experimental science features such as rejection supports the hypothesis of the experiment. This is not the case in SEM, where the null hypothesis is that the model fits the data. Thus, a non-significant chi-square is an indicator of model fit, whereas a significant chi-square rejects model fit. A notable limitation of this is that a greater sample size, greater N, will increase the chi-square value and will therefore increase the power to reject model fit. Thus, using conventional chi-squared testing will tend to support models developed in small samples and reject models developed in large samples. The choice an interpretation of fit measures is a contentious one in SEM literature. However, as can be seen, chi-square has limitations. As such, other model fit criteria were developed that do not penalize models that fit in large samples (some may penalize models fit to small samples though). There are over a dozen indices, but the most common fit indices and interpretation information are as follows: Comparative fit index: In this index, a higher value is better. Conventionally, a value of greater than 0.9 was considered an indicator of good model fit, but some might argue that a value of at least 0.95 is needed. This is relatively sample size insensitive. Root mean square error of approximation: A value of under 0.08 (smaller is better) is often considered necessary to achieve model fit. However, this fit measure is quite sample size sensitive, penalizing small sample studies. Tucker-Lewis index (Non-normed fit index): This is interpreted in a similar manner as the comparative fit index. Also, this is not very sample size sensitive. Standardized root mean square residual: In this index, a lower value is better. A value of 0.06 or less is considered needed for model fit. Also, this may penalize small samples. In the next section, we will show you how to actually fit SEM models in R and how to evaluate fit using fit measures. Using OpenMx and matrix specification of an SEM We went through the basic principles of SEM and discussed the basic computational approach by which this can be achieved. SEM remains an active area of research (with an entire journal devoted to it, Structural Equation Modeling), so there are many additional peculiarities, but rather than delving into all of them, we will start by delving into actually fitting an SEM model in R. OpenMx is not in the CRAN repository, but it is easily obtainable from the OpenMx website, by typing the following in R: source('http://openmx.psyc.virginia.edu/getOpenMx.R')" Summarizing the OpenMx approach In this example, we will use OpenMx by specifying matrices as mentioned earlier. To fit an OpenMx model, we need to first specify the model and then tell the software to attempt to fit the model. Model specification involves four components: Specifying the model matrices; this has two parts: Declare starting values for the estimation Declaring which values can be estimated and which are fixed Telling OpenMx the algebraic relationship of the matrices that should produce an implied covariance matrix Giving an instruction for the model fitting criterion Providing a source of data The R commands that correspond to each of these steps are: mxMatrix mxAlgebra mxMLObjective mxData We will then pass the objects created with each of these commands to create an SEM model using mxModel. Explaining an entire example First, to make things simple, we will store the FALSE and TRUE logical values in single letter variables, which will be convenient when we have matrices full of TRUE and FALSE values as follows: F <- FALSE T <- TRUE Specifying the model matrices Specifying matrices is done with the mxMatrix function, which returns an MxMatrix object. (Note that the object starts with a capital "M" while the function starts with a lowercase "m.") Specifying an MxMatrix is much like specifying a regular R matrix, but MxMatrices has some additional components. The most notable difference is that there are actually two different matrices used to create an MxMatrix. The first is a matrix of starting values, and the second is a matrix that tells which starting values are free to be estimated and which are not. If a starting value is not freely estimable, then it is a fixed constant. Since the actual starting values that we choose do not really matter too much in this case, we will just pick one as a starting value for all parameters that we would like to be estimated. Let's take a look at the following example: mx.A <- mxMatrix( type = "Full", nrow=14, ncol=14, #Provide the Starting Values values = c(    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0 ), #Tell R which values are free to be estimated    free = c(    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, T, F ), byrow=TRUE,   #Provide a matrix name that will be used in model fitting name="A", ) We will now apply this same technique to the S matrix. Here, we will create two S matrices, S1 and S2. They differ simply in the starting values that they supply. We will later try to fit an SEM model using one matrix, and then the other to address problems with the first one. The difference is that S1 uses starting variances of 1 in the diagonal, and S2 uses starting variances of 5. Here, we will use the "symm" matrix type, which is a symmetric matrix. We could use the "full" matrix type, but by using "symm", we are saved from typing all of the symmetric values in the upper half of the matrix. Let's take a look at the following matrix: mx.S1 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    1,    0, 1,    0, 0, 1,    0, 1, 0, 1,    1, 0, 0, 0, 1,    0, 1, 0, 0, 0, 1,    0, 0, 1, 0, 0, 0, 1,    0, 0, 0, 1, 0, 1, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ),      free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" )   #The alternative, S2 matrix: mx.S2 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    5,    0, 5,    0, 0, 5,    0, 1, 0, 5,    1, 0, 0, 0, 5,    0, 1, 0, 0, 0, 5,    0, 0, 1, 0, 0, 0, 5,    0, 0, 0, 1, 0, 1, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 ),         free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" ) mx.Filter <- mxMatrix("Full", nrow=11, ncol=14, values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,      0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0    ),    free=FALSE,    name="Filter",    byrow = TRUE ) And finally, we will create our identity and filter matrices the same way, as follows: mx.I <- mxMatrix("Full", nrow=14, ncol=14,    values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1    ),    free=FALSE,    byrow = TRUE,    name="I" ) Fitting the model Now, it is time to declare the model that we would like to fit using the mxModel command. This part includes steps 2 through step 4 mentioned earlier. Here, we will tell mxModel which matrices to use. We will then use the mxAlgegra command to tell R how the matrices should be combined to reproduce the implied covariance matrix. We will tell R to use ML estimation with the mxMLObjective command, and we will tell it to apply the estimation to a particular matrix algebra, which we named "C". This is simply the right-hand side of the McArdle McDonald equation. Finally, we will tell R where to get the data to use in model fitting using the following code: factorModel.1 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S1, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),      mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) ) Now, let's tell R to fit the model and summarize the results using mxRun, as follows: summary(mxRun(factorModel.1)) Running Political Democracy Model Error in summary(mxRun(factorModel.1)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error: The job for model 'Political Democracy Model' exited abnormally with the error message: Expected covariance matrix is non-positive-definite. Uh oh! We got an error message telling us that the expected covariance matrix is not positive definite. Our observed covariance matrix is positive definite but the implied covariance matrix (at least at first) is not. This is an effect of the fact that if we multiply our starting value matrices together as specified by the McArdle McDonald equation, we get a starting implied covariance matrix. If we perform an eigenvalue decomposition of this starting implied covariance matrix, then we will find that the last eigenvalue is negative. This means a negative variance does not make much sense, and this is what "not positive definite" refers to. The good news is that this is simply our starting values, so we can fix this if we modify our starting values. In this case, we can choose values of five along the diagonal of the S matrix, and get a positive definite starting implied covariance matrix. We can rerun this using the mx.S2 matrix specified earlier and the software will proceed as follows: #Rerun with a positive definite matrix   factorModel.2 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S2, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),    mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) )   summary(mxRun(factorModel.2)) This should provide a solution. As can be seen from the previous code, the parameters solved in the model are returned as matrix components. Just like we had to figure out how to go from paths to matrices, we now have to figure out how to go from matrices to paths (the reverse problem). In the following screenshot, we show just the first few free parameters: The preceding screenshot tells us that the parameter estimated in the position of the tenth row and twelfth column in the matrix A is 2.18. This corresponds to a path from the twelfth variable in the A matrix ind60, to the 10th variable in the matrix x2. Thus, the path coefficient from ind60 to x2 is 2.18. There are a few other pieces of information here. The first one tells us that the model has not converged but is "Mx status Green." This means that the model was still converging when it stopped running (that is, it did not converge), but an optimal solution was still found and therefore, the results are likely reliable. Model fit information is also provided suggesting a pretty good model fit with CFI of 0.99 and RMSEA of 0.032. This was a fair amount of work, and creating model matrices by hand from path diagrams can be quite tedious. For this reason, SEM fitting programs have generally adopted the ability to fit SEM by declaring paths rather than model matrices. OpenMx has the ability to allow declaration by paths, but applying model matrices has a few advantages. Principally, we get under the hood of SEM fitting. If we step back, we can see that OpenMx actually did very little for us that is specific to SEM. We told OpenMx how we wanted matrices multiplied together and which parameters of the matrix were free to be estimated. Instead of using the RAM specification, we could have passed the matrices of the LISREL or Bentler-Weeks models with the corresponding algebra methods to recreate an implied covariance matrix. This means that if we are trying to come up with our matrix specification, reproduce prior research, or apply a new SEM matrix specification method published in the literature, OpenMx gives us the power to do it. Also, for educators wishing to teach the underlying mathematical ideas of SEM, OpenMx is a very powerful tool. Fitting SEM models using lavaan If we were to describe OpenMx as the SEM equivalent of having a well-stocked pantry and full kitchen to create whatever you want, and you have the time and know how to do it, we might regard lavaan as a large freezer full of prepackaged microwavable dinners. It does not allow quite as much flexibility as OpenMx because it sweeps much of the work that we did by hand in OpenMx under the rug. Lavaan does use an internal matrix representation, but the user never has to see it. It is this sweeping under the rug that makes lavaan generally much easier to use. It is worth adding that the list of prepackaged features that are built into lavaan with minimal additional programming challenge many commercial SEM packages. The lavaan syntax The key to describing lavaan models is the model syntax, as follows: X =~ Y: Y is a manifestation of the latent variable X Y ~ X: Y is regressed on X Y ~~ X: The covariance between Y and X can be estimated Y ~ 1: This estimates the intercept for Y (implicitly requires mean structure) Y | a*t1 + b*t2: Y has two thresholds that is a and b Y ~ a * X: Y is regressed on X with coefficient a Y ~ start(a) * X: Y is regressed on X; the starting value used for estimation is a It may not be evident at first, but this model description language actually makes lavaan quite powerful. Wherever you have seen a or b in the previous examples, a variable or constant can be used in their place. The beauty of this is that multiple parameters can be constrained to be equal simply by assigning a single parameter name to them. Using lavaan, we can fit a factor analysis model to our physical functioning dataset with only a few lines of code: phys.func.data <- read.csv('phys_func.csv')[-1] names(phys.func.data) <- LETTERS[1:20] R has a built-in vector named LETTERS, which contains all of the capital letters of the English alphabet. The lower case vector letters contains the lowercase alphabet. We will then describe our model using the lavaan syntax. Here, we have a model of three latent variables, our factors, and each of them has manifest variables. Let's take a look at the following example: model.definition.1 <- ' #Factors    Cognitive =~ A + Q + R + S    Legs =~ B + C + D + H + I + J + M + N    Arms =~ E + F+ G + K +L + O + P + T    #Correlations Between Factors    Cognitive ~~ Legs    Cognitive ~~ Arms    Legs ~~ Arms ' We then tell lavaan to fit the model as follows: fit.phys.func <- cfa(model.definition.1, data=phys.func.data, ordered= c('A','B', 'C','D', 'E','F','G', 'H','I','J', 'K', 'L','M','N','O','P','Q','R', 'S', 'T')) In the previous code, we add an ordered = argument, which tells lavaan that some variables are ordinal in nature. In response, lavaan estimates polychoric correlations for these variables. Polychoric correlations assume that we binned a continuous variable into discrete categories, and attempts to explicitly model correlations assuming that there is some continuous underlying variable. Part of this requires finding thresholds (placed on an arbitrary scale) between each categorical response. (for example, threshold 1 falls between the response of 1 and 2, and so on). By telling lavaan to treat some variables as categorical, lavaan will also know to use a special estimation method. Lavaan will use diagonally weighted least squares, which does not assume normality and uses the diagonals of the polychoric correlation matrix for weights in the discrepancy function. With five response options, it is questionable as to whether polychoric correlations are truly needed. Some analysts might argue that with many response options, the data can be treated as continuous, but here we use this method to show off lavaan's capabilities. All SEM models in lavaan use the lavaan command. Here, we use the cfa command, which is one of a number of wrapper functions for the lavaan command. Others include sem and growth. These commands differ in the default options passed to the lavaan command. (For full details, see the package documentation.) Summarizing the data, we can see the loadings of each item on the factor as well as the factor intercorrelations. We can also see the thresholds between each category from the polychoric correlations as follows: summary(fit.phys.func) We can also assess things such as model fit using the fitMeasures command, which has most of the popularly used fit measures and even a few obscure ones. Here, we tell lavaan to simply extract three measures of model fit as follows: fitMeasures(fit.phys.func, c('rmsea', 'cfi', 'srmr')) Collectively, these measures suggest adequate model fit. It is worth noting here that the interpretation of fit measures largely comes from studies using maximum likelihood estimation, and there is some debate as to how well these generalize other fitting methods. The lavaan package also has the capability to use other estimators that treat the data as truly continuous in nature. For this, a particular dataset is far from multivariate normal distributed, so an estimator such as ML is appropriate to use. However, if we wanted to do so, the syntax would be as follows: fit.phys.func.ML <- cfa(model.definition.1, data=phys.func.data, estimator = 'ML') Comparing OpenMx to lavaan It can be seen that lavaan has a much simpler syntax that allows to rapidly model basic SEM models. However, we were a bit unfair to OpenMx because we used a path model specification for lavaan and a matrix specification for OpenMx. The truth is that OpenMx is still probably a bit wordier than lavaan, but let's apply a path model specification in each to do a fair head-to-head comparison. We will use the famous Holzinger-Swineford 1939 dataset here from the lavaan package to do our modeling, as follows: hs.dat <- HolzingerSwineford1939 We will create a new dataset with a shorter name so that we don't have to keep typing HozlingerSwineford1939. Explaining an example in lavaan We will learn to fit the Holzinger-Swineford model in this section. We will start by specifying the SEM model using the lavaan model syntax: hs.model.lavaan <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed   =~ x7 + x8 + x9   visual ~~ textual visual ~~ speed textual ~~ speed '   fit.hs.lavaan <- cfa(hs.model.lavaan, data=hs.dat, std.lv = TRUE) summary(fit.hs.lavaan) Here, we add the std.lv argument to the fit function, which fixes the variance of the latent variables to 1. We do this instead of constraining the first factor loading on each variable to 1. Only the model coefficients are included for ease of viewing in this book. The result is shown in the following model: > summary(fit.hs.lavaan) …                      Estimate Std.err Z-value P(>|z|) Latent variables: visual =~    x1               0.900   0.081   11.127   0.000    x2               0.498   0.077   6.429   0.000    x3              0.656   0.074   8.817   0.000 textual =~    x4               0.990   0.057   17.474   0.000    x5               1.102   0.063   17.576   0.000    x6               0.917   0.054   17.082   0.000 speed =~    x7               0.619   0.070   8.903   0.000    x8               0.731   0.066   11.090   0.000    x9               0.670   0.065   10.305   0.000   Covariances: visual ~~    textual           0.459   0.064   7.189   0.000    speed             0.471   0.073   6.461   0.000 textual ~~    speed             0.283   0.069   4.117   0.000 Let's compare these results with a model fit in OpenMx using the same dataset and SEM model. Explaining an example in OpenMx The OpenMx syntax for path specification is substantially longer and more explicit. Let's take a look at the following model: hs.model.open.mx <- mxModel("Holzinger Swineford", type="RAM",      manifestVars = names(hs.dat)[7:15], latentVars = c('visual', 'textual', 'speed'),    # Create paths from latent to observed variables mxPath(        from = 'visual',        to = c('x1', 'x2', 'x3'),    free = c(TRUE, TRUE, TRUE),    values = 1          ), mxPath(        from = 'textual',        to = c('x4', 'x5', 'x6'),        free = c(TRUE, TRUE, TRUE),        values = 1      ), mxPath(    from = 'speed',    to = c('x7', 'x8', 'x9'),    free = c(TRUE, TRUE, TRUE),    values = 1      ), # Create covariances among latent variables mxPath(    from = 'visual',    to = 'textual',    arrows=2,    free=TRUE      ), mxPath(        from = 'visual',        to = 'speed',        arrows=2,        free=TRUE      ), mxPath(        from = 'textual',        to = 'speed',        arrows=2,        free=TRUE      ), #Create residual variance terms for the latent variables mxPath(    from= c('visual', 'textual', 'speed'),    arrows=2, #Here we are fixing the latent variances to 1 #These two lines are like st.lv = TRUE in lavaan    free=c(FALSE,FALSE,FALSE),    values=1 ), #Create residual variance terms mxPath( from= c('x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'),    arrows=2, ),    mxData(        observed=cov(hs.dat[,c(7:15)]),        type="cov",        numObs=301    ) )     fit.hs.open.mx <- mxRun(hs.model.open.mx) summary(fit.hs.open.mx) Here are the results of the OpenMx model fit, which look very similar to lavaan's. This gives a long output. For ease of viewing, only the most relevant parts of the output are included in the following model (the last column that R prints giving the standard error of estimates is also not shown here): > summary(fit.hs.open.mx) …   free parameters:                            name matrix     row     col Estimate Std.Error 1   Holzinger Swineford.A[1,10]     A     x1 visual 0.9011177 2   Holzinger Swineford.A[2,10]     A     x2 visual 0.4987688 3   Holzinger Swineford.A[3,10]     A     x3 visual 0.6572487 4   Holzinger Swineford.A[4,11]     A     x4 textual 0.9913408 5   Holzinger Swineford.A[5,11]     A     x5 textual 1.1034381 6   Holzinger Swineford.A[6,11]     A     x6 textual 0.9181265 7   Holzinger Swineford.A[7,12]     A     x7   speed 0.6205055 8   Holzinger Swineford.A[8,12]     A     x8 speed 0.7321655 9   Holzinger Swineford.A[9,12]     A     x9   speed 0.6710954 10   Holzinger Swineford.S[1,1]     S     x1     x1 0.5508846 11   Holzinger Swineford.S[2,2]     S     x2     x2 1.1376195 12   Holzinger Swineford.S[3,3]     S    x3     x3 0.8471385 13   Holzinger Swineford.S[4,4]     S     x4     x4 0.3724102 14   Holzinger Swineford.S[5,5]     S     x5     x5 0.4477426 15   Holzinger Swineford.S[6,6]     S     x6     x6 0.3573899 16   Holzinger Swineford.S[7,7]      S     x7     x7 0.8020562 17   Holzinger Swineford.S[8,8]     S     x8     x8 0.4893230 18   Holzinger Swineford.S[9,9]     S     x9     x9 0.5680182 19 Holzinger Swineford.S[10,11]     S visual textual 0.4585093 20 Holzinger Swineford.S[10,12]     S visual   speed 0.4705348 21 Holzinger Swineford.S[11,12]     S textual   speed 0.2829848 In summary, the results agree quite closely. For example, looking at the coefficient for the path going from the latent variable visual to the observed variable x1, lavaan gives an estimate of 0.900 while OpenMx computes a value of 0.901. Summary The lavaan package is user friendly, pretty powerful, and constantly adding new features. Alternatively, OpenMx has a steeper learning curve but tremendous flexibility in what it can do. Thus, lavaan is a bit like a large freezer full of prepackaged microwavable dinners, whereas OpenMx is like a well-stocked pantry with no prepared foods but a full kitchen that will let you prepare it if you have the time and the know-how. To run a quick analysis, it is tough to beat the simplicity of lavaan, especially given its wide range of capabilities. For large complex models, OpenMx may be a better choice. The methods covered here are useful to analyze statistical relationships when one has all of the data from events that have already occurred. Resources for Article: Further resources on this subject: Creating your first heat map in R [article] Going Viral [article] Introduction to S4 Classes [article]
Read more
  • 0
  • 0
  • 6841

article-image-9-recommended-blockchain-online-courses
Guest Contributor
27 Sep 2018
7 min read
Save for later

9 recommended blockchain online courses

Guest Contributor
27 Sep 2018
7 min read
Blockchain is reshaping the world as we know it. And we are not talking metaphorically because the new technology is really influencing everything from online security and data management to governance and smart contracting. Statistical reports support these claims. According to the study, the blockchain universe grows by over 40% annually, while almost 70% of banks are already experimenting with this technology. IT experts at the Editing AussieWritings.com Services claim that the potential in this field is almost limitless: “Blockchain offers a myriad of practical possibilities, so you definitely want to get acquainted with it more thoroughly.” Developers who are curious about blockchain can turn it into a lucrative career opportunity since it gives them the chance to master the art of cryptography, hierarchical distribution, growth metrics, transparent management, and many more. There were 5,743 mostly full-time job openings calling for blockchain skills in the last 12 months - representing the 320% increase - while the biggest freelancing website Upwork reported more than 6,000% year-over-year growth. In this post, we will recommend our 9 best blockchain online courses. Let’s take a look! Udemy Udemy offers users one of the most comprehensive blockchain learning sources. The target audience is people who have heard a little bit about the latest developments in this field, but want to understand more. This online course can help you to fully understand how the blockchain works, as well as get to grips with all that surrounds it. Udemy breaks down the course into several less complicated units, allowing you to figure out this complex system rather easily. It costs $19.99, but you can probably get it with a 40% discount. The one downside, however, is that content quality in terms of subject scope can vary depending on the instructor, but user reviews are a good way to gauge quality. Each tutorial lasts approximately 30 minutes, but it also depends on your own tempo and style of work. Pluralsight Pluralsight is an excellent beginner-level blockchain course. It comes in three versions: Blockchain Fundamentals, Surveying Blockchain Technologies for Enterprise, and Introduction to Bitcoin and Decentralized Technology. Course duration varies from 80 to 200 minutes depending on the package. The price of Pluralsight is $29 a month or $299 a year. Choosing one of these options, you are granted access to the entire library of documents, including course discussions, learning paths, channels, skill assessments, and other similar tools. Packt Publishing Packt Publishing has a wide portfolio of learning products on Blockchain for varying levels of experience in the field from beginners to experts. And what’s even more interesting is that you can choose your learning format from books, ebooks to videos, courses and live courses. Or you could simply subscribe to MAPT, their library to gain access to all products at a reasonable price of $29 monthly and $150 annually.  It offers several books and videos on the leading blockchain technology. You can purchase 5 blockchain titles at a discounted rate of $50. Here’s the list of top blockchain courses offered by Packt Publishing: Exploring Blockchain and Crypto-currencies: You will gain the foundational understanding of blockchain and crypto-currencies through various use-cases. Building Blockchain Projects: In this, you will be able to develop real-time practical DApps with Ethereum and JavaScript. Mastering Blockchain - Second Edition: You can learn about cryptography and cryptocurrencies, so you can build highly secure, decentralized applications and conduct trusted in-app transactions. Hands-On Blockchain with Hyperledger: This book will help you leverage the power of Hyperledger Fabric to develop Blockchain-based distributed ledgers with ease. Learning Blockchain Application Development [video ]: This interactive video will help you learn build smart contracts and DApps on Ethereum. Create Ethereum and Blockchain Applications using Solidity [video ]: This video will help you learn about Ethereum, Solidity, DAO, ICO, Bitcoin, Altcoin, Website Security, Ripple, Litecoin, Smart Contracts, and Apps. Cryptozombies Cryptozombies is an online blockchain course based on gamification elements. The tool teaches you to write smart contracts in Solidity through building your own crypto-collectibles game. It is entirely Ethereum-focused, but you don’t need any previous experience to understand how Solidity works. There is a step by step guide that explains to you even the smallest details, so you can quickly learn to create your own fully-functional blockchain-based game. The best thing about Cryptozombies is that you can test it for free and give up in case you don’t like it. Coursera The blockchain is the epicenter of the cryptocurrency world, so it’s necessary to study it if you want to deal with Bitcoin and other digital currencies. Coursera is the leading online resource in the field of virtual currencies, so you might want to check it out. After this course like Blockchain Specialization, you’ll know everything you need to be able to separate fact from fiction when reading claims about Bitcoin and other cryptocurrencies. You’ll have the conceptual foundations you need to engineer to secure software that interacts with the Bitcoin network. And you’ll be able to integrate ideas from Bitcoin in your own projects. The course is a 4-part course spanning a duration 4 weeks, but you can take each part separately. The price depends on the level and features you choose. LinkedIn Learning (formerly known as Lynda) LinkedIn Learning (what used to be Lynda) doesn't offer a specific blockchain course, but it does have a wide range of industry-related learning sources. A search for ‘blockchain’ will present you with almost 100 relevant video courses. You can find all sorts of lessons here, from beginner to expert levels. Lynda allows you to customize selection according to video duration, authors, software, subjects, etc. You can access the library for $15 a month. B9Lab B9Lab ETH-25 Certified Online Ethereum Developer Course is another course that promotes blockchain technology aimed at the Ethereum platform. It’s a 12-week in-depth learning solution that targets experienced programmers. B9Lab introduces everything there is to know about blockchain and how to build useful applications. Participants are taught about the Ethereum platform, the programming language Solidity, how to use web3 and the Truffle framework, and how to tie everything together. The price is €1450 or about $1700. IBM IBM made a self-paced blockchain course, titled Blockchain Essentials that lasts over two hours. The video lectures and lab in this course help you learn about blockchain for business and explore key use cases that demonstrate how the technology adds value. You can learn how to leverage blockchain benefits, transform your business with the new technology, and transfer assets. Besides that, you get a nice wrap-up and a quiz to test your knowledge upon completion. IBM’s course is free of charge. Khan Academy Khan Academy is the last, but certainly not the least important online course on our list. It gives users a comprehensive overview of blockchain-powered systems, particularly Bitcoin. Using this platform, you can learn more on cryptocurrency transactions, security, proof of work, etc. As an online education platform, Khan Academy won’t cost you a dime. [dropcap]B[/dropcap]lockchain is the groundbreaking technology that opens new boundaries in almost every field of business. It directly influences financial markets, data management, digital security, and a variety of other industries. In this post, we presented 9 best blockchain online courses you should try. These sources can teach you everything there is to know about the blockchain basics. Take some time to check them out and you won’t regret it! Author Bio: Olivia is a passionate blogger who writes on topics of digital marketing, career, and self-development. She constantly tries to learn something new and to share this experience on various websites. Connect with her on Facebook and Twitter. Google introduces Machine Learning courses for AI beginners Microsoft start AI School to teach Machine Learning and Artificial Intelligence.
Read more
  • 0
  • 0
  • 6810

article-image-oracle-e-business-suite-creating-bank-accounts-and-cash-forecasts
Packt
19 Aug 2011
3 min read
Save for later

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Packt
19 Aug 2011
3 min read
  Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook   Introduction Oracle E-business suite The liquidity of an organization is managed in Oracle Cash Management; this includes the reconciliation of the cashbook to the bank statements, and forecasting future cash requirements. In this article, we will look at how to create bank accounts and cash forecasts. Cash management integrates with Payables, Receivables, Payroll, Treasury, and General Ledger. Let's start by looking at the cash management process: The Bank generates statements. The statements are sent to the organization electronically or by post. The Treasury Administrator loads and verifies the bank statement into cash management. The statements can also be manually entered into cash management. The loaded statements are reconciled to the cash book transactions. The results are reviewed, and amended if required. The Treasury Administrator creates the journals for transactions in the General Ledger. Creating bank accounts Oracle Cash Management provides us with the functionality to create bank accounts. In this recipe, we will create a bank account for a bank called Shepherd Bank, for one of their branches called Kings Cross branch. Getting ready Log in to Oracle E-Business Suite R12 with the username and password assigned to you by the system administrator. If you are working on the Vision demonstration database, you can use OPERATIONS/WELCOME as the USERNAME/PASSWORD. We also need to create a bank before we can create the bank account. Let's look at how to create a bank and the branch: Select the Cash Management responsibility. Navigate to Setup | Banks | Banks.(Move the mouse over the image to enlarge it.) In the Banks tab, click on the Create button. Select the Create new bank option. In the Country field, enter United States. In the Bank Name field, enter Shepherds Bank. In the Bank Number field, enter JN316. Click on the Finish button. Let's create the branch and the address: (Move the mouse over the image to enlarge it.) Click the Create Branch icon: The Country and the Bank Name are automatically entered. Click on the Continue button.(Move the mouse over the image to enlarge it.) In the Branch Name field, enter Kings Cross. Select ABA as the Branch Type. Click on the Save and Next button to create the Branch address.(Move the mouse over the image to enlarge it.) In the Branch Address form, click on the create button. In the Country field, enter United States. In the Address Line 1 field, enter 4234 Red Eagle Road. In the City field, enter Sacred Heart. In the County field, enter Renville. In the State field, enter MN. In the Postal Code field, enter 56285. Ensure that the Status field is Active. Click on the Apply button. Click on the Finish button.
Read more
  • 0
  • 0
  • 6802

article-image-visualizations-using-ccc
Packt
20 May 2016
28 min read
Save for later

Visualizations Using CCC

Packt
20 May 2016
28 min read
In this article by Miguel Gaspar the author of the book Learning Pentaho CTools you will learn about the Charts Component Library in detail. The Charts Components Library is not really a Pentaho plugin, but instead is a Chart library that Webdetails created some years ago and that Pentaho started to use on the Analyzer visualizations. It allows a great level of customization by changing the properties that are applied to the charts and perfectly integrates with CDF, CDE, and CDA. (For more resources related to this topic, see here.) The dashboards that Webdetails creates make use of the CCC charts, usually with a great level of customization. Customizing them is a way to make them fancy and really good-looking, and even more importantly, it is a way to create a visualization that best fits the customer/end user's needs. We really should be focused on having the best visualizations for the end user, and CCC is one of the best ways to achieve this, but do this this you need to have a very deep knowledge of the library, and know how to get amazing results. I think I could write an entire book just about CCC, and in this article I will only be able to cover a small part of what I like, but I will try to focus on the basics and give you some tips and tricks that could make a difference.I'll be happy if I can give you some directions that you follow, and then you can keep searching and learning about CCC. An important part of CCC is understanding some properties such as the series in rows or the crosstab mode, because that is where people usually struggle at the start. When you can't find a property to change some styling/functionality/behavior of the charts, you might find a way to extend the options by using something called extension points, so we will also cover them. I also find the interaction within the dashboard to be an important feature.So we will look at how to use it, and you will see that it's very simple. In this article,you will learn how to: Understand the properties needed to adapt the chart to your data source results Use the properties of a CCC chart Create a CCC chat by using the JavaScript library Make use of internationalization of CCC charts See how to handle clicks on charts Scale the base axis Customize the tooltips Some background on CCC CCC is built on top of Protovis, a JavaScript library that allows you to produce visualizations just based on simple marks such as bars, dots, and lines, among others, which are created through dynamic properties based on the data to be represented. You can get more information on this at: http://mbostock.github.io/protovis/. If you want to extend the charts with some elements that are not available you can, but it would be useful to have an idea about how Protovis works.CCC has a great website, which is available at http://www.webdetails.pt/ctools/ccc/, where you can see some samples including the source code. On the page, you can edit the code, change some properties, and click the apply button. If the code is valid, you will see your chart update.As well as that, it provides documentation for almost all of the properties and options that CCC makes available. Making use of the CCC library in a CDF dashboard As CCC is a chart library, you can use it as you would use it on any other webpage, by using it like the samples on CCC webpages. But CDF also provides components that you can implement to use a CCC chart on a dashboard and fully integrate with the life cycle of the dashboard. To use a CCC chart on CDF dashboard, the HTML that is invoked from the XCDF file would look like the following(as we already covered how to build a CDF dashboard, I will not focus on that, and will mainly focus on the JavaScript code): <div class="row"> <div class="col-xs-12"> <div id="chart"/> </div> </div> <script language="javascript" type="text/javascript">   require(['cdf/Dashboard.Bootstrap', 'cdf/components/CccBarChartComponent'], function(Dashboard, CccBarChartComponent) {     var dashboard = new Dashboard();     var chart = new CccBarChartComponent({         type: "cccBarChart",         name: "cccChart",         executeAtStart: true,         htmlObject: "chart",         chartDefinition: {             height: 200,             path: "/public/…/queries.cda",             dataAccessId: "totalSalesQuery",             crosstabMode: true,             seriesInRows: false, timeSeries: false             plotFrameVisible: false,             compatVersion: 2         }     });     dashboard.addComponent(chart);     dashboard.init();   }); </script> The most important thing here is the use of the CCC chart component that we have covered as an example in which we have covered it's a bar chart. We can see by the object that we are instantiating CccBarChartComponent as also by the type that is cccBarChart. The previous dashboard will execute the query specified as dataAccessId of the CDA file set on the property path, and render the chart on the dashboard. We are also saying that its data comes from the query in the crosstab mode, but the base axis should not be atimeSeries. There are series in the columns, but don't worry about this as we'll be covering it later. The existing CCC components that you are able to use out of the box inside CDF dashboards are as follows. Don't forget that CCC has plenty of charts, so the sample images that you will see in the following table are just one example of the type of charts you can achieve. CCC Component Chart Type Sample Chart CccAreaChartComponent cccAreaChart   CccBarChartComponent cccBarChart http://www.webdetails.pt/ctools/ccc/#type=bar CccBoxplotChartComponent cccBoxplotChart http://www.webdetails.pt/ctools/ccc/#type=boxplot CccBulletChartComponent cccBulletChart http://www.webdetails.pt/ctools/ccc/#type=bullet CccDotChartComponent cccDotChart http://www.webdetails.pt/ctools/ccc/#type=dot CccHeatGridChartComponent cccHeatGridChart http://www.webdetails.pt/ctools/ccc/#type=heatgrid CccLineChartComponent cccLineChart http://www.webdetails.pt/ctools/ccc/#type=line CccMetricDotChartComponent cccMetricDotChart http://www.webdetails.pt/ctools/ccc/#type=metricdot CccMetricLineChartComponent cccMetricLineChart   CccNormalizedBarChartComponent cccNormalizedBarChart   CccParCoordChartComponent cccParCoordChart   CccPieChartComponent cccPieChart http://www.webdetails.pt/ctools/ccc/#type=pie CccStackedAreaChartComponent cccStackedAreaChart http://www.webdetails.pt/ctools/ccc/#type=stackedarea CccStackedDotChartComponent cccStackedDotChart   CccStackedLineChartComponent cccStackedLineChart http://www.webdetails.pt/ctools/ccc/#type=stackedline CccSunburstChartComponent cccSunburstChart http://www.webdetails.pt/ctools/ccc/#type=sunburst CccTreemapAreaChartComponent cccTreemapAreaChart http://www.webdetails.pt/ctools/ccc/#type=treemap CccWaterfallAreaChartComponent cccWaterfallAreaChart http://www.webdetails.pt/ctools/ccc/#type=waterfall In the sample code, you will find a property calledcompatMode that hasa value of 2 set. This will make CCC work as a revamped version that delivers more options, a lot of improvements, and makes it easier to use. Mandatory and desirable properties Among otherssuch as name, datasource, and htmlObject, there are other properties of the charts that are mandatory. The height is really important, because if you don't set the height of the chart, you will not fit the chart in the dashboard. The height should also be specified in pixels. If you don't set the width of the component, or to be more precise, then the chart will grab the width of the element where it's being rendered it will grab the width of the HTML element with the name specified in the htmlObject property. The seriesInRows, crosstabMode, and timeseriesproperties are optional, but depending on the kind of chart you are generating, you might want to specify them. The use of these properties becomes clear if we can also see the output of the queries we are executing. We need to get deeper into the properties that are related to the data mapping to visual elements. Mapping data We need to be aware of the way that data mapping is done in the chart.You can understand how it works if you can imagine data input as a table. CCC can receive the data as two different structures: relational and crosstab. If CCC receives data as crosstab,it will translate it to a relational structure. You can see this in the following examples. Crosstab The following table is an example of the crosstab data structure: Column Data 1 Column Data 2 Row Data 1 Measure Data 1.1 Measure Data 1.2 Row Data 2 Measure Data 2.1 Measure Data 2.2 Creating crosstab queries To create a crosstab query, usually you can do this with the group when using SQL, or just use MDX, which allows us to easily specify a set for the columns and for the rows. Just by looking at the previous and following examples, you should be able to understand that in the crosstab structure (the previous), columns and rows are part of the result set, while in the relational format (the following), column headers or headers are not part of the result set, but are part of the metadata that is returned from the query. The relationalformat is as follows: Column Row Value Column Data 1 Row Data 1 Measure Data 1.1 Column Data 2 Row Data 1 Measure Data 2.1 Column Data 1 Row Data 2 Measure Data 1.2 Column Data 2 Row Data 2 Measure Data 2.1   The preceding two data structures represent the options when setting the properties crosstabMode and seriesInRows. The crosstabMode property To better understand these concepts, we will make use of a real example. This property, crosstabMode, is easy to understand when comparing the two that represents the results of two queries. Non-crosstab (Relational): Markets Sales APAC 1281705 EMEA 50028224 Japan 503957 NA 3852061 Crosstab: Markets 2003 2004 2005 APAC 3529 5938 3411 EMEA 16711 23630 9237 Japan 2851 1692 380 NA 13348 18157 6447   In the previous tables, you can see that on the left-handside you can find the values of sales from each of the territories. The only relevant information relative to the values presented in only one variable, territories. We can say that we are able to get all the information just by looking at the rows, where we can see a direct connection between markets and the sales value. In the table presented on the right, you will find a value for each territory/year, meaning that the values presented, and in the sample provided in the matrix, are dependent on two variables, which are the territory in the rows and the years in the columns. Here we need both the rows andthe columns to know what each one of the values represents. Relevant information can be found in the rows and the columns, so this a crosstab. The crosstabs display the joint distribution of two or more variables, and are usually represented in the form of a contingency table in a matrix. When the result of a query is dependent only on one variable, then you should set the crosstabModeproperty to false. When it is dependent on 2 or more variables, you should set the crosstabMode property to false, otherwise CCC will just use the first two columns like in the non-crosstab example. The seriesInRows property Now let's use the same examplewhere we have a crosstab: The previous image shows two charts: the one on the left is a crosstab with the series in the rows, and the one on the right is also crosstab but the series are not in the rows (the series are in the columns).When the crosstab is set to true, it means that the measure column title can be translated as a series or a category,and that's determined by the property seriesInRows. If this property is set to true, then it will read the series from the rows, otherwise it will read the series from the columns. If the crosstab is set to false, the community chart component is expecting a row to correspond exactly to one data point, and two or three columns can be returned. When three columns are returned, they can be a category, series and dataor series, category and data and that's determined by the seriesInRows property. When set to true, CCC will expect the structure to have three columns such as category, series, and data. When it is set to false, it will expect them to be series, category, and data. A simple table should give you a quicker reference, so here goes: crosstabMode seriesInRows Description true true The column titles will act as category values while the series values are represented as data points of the first column. true false The column titles will act as series value while the category/category values are represented as data points of the first column. false true The column titles will act as category values while the series values are represented as data points of the first column. false false The column titles will act as category values while the series values are represented as data points of the first column. The timeSeries and timeSeriesFormat properties The timeSeries property defines whether the data to be represented by the chart is discrete or continuous. If we want to present some values over time, then the timeSeries property should be set to true. When we set the chart to be timeSeries, we also need to set another property to tell CCC how it should interpret the dates that come from the query.Check out the following image for timeSeries and timeSeriesFormat: The result of one of the queries has the year and the abbreviated month name separate by -, like 2015-Nov. For the chart to understand it as a date, we need to specify the format by setting the property timeSeriesFomart, which in our example would be %Y-%b, where %Y is the year is represented by four digits, and %b is the abbreviated month name. The format should be specified using the Protovis format that follows the same format as strftime in the C programming language, aside from some unsupported options. To find out what options are available, you should take a look at the documentation, which you will find at: https://mbostock.github.io/protovis/jsdoc/symbols/pv.Format.date.html. Making use of CCC inCDE There are a lot of properties that will use a default value, and you can find out aboutthem by looking at the documentation or inspecting the code that is generated by CDE when you use the charts components. By looking at the console log of your browser, you should also able to understand and get some information about the properties being used by default and/or see whether you are using a property that does not fit your needs. The use of CCC charts in CDE is simpler, just because you may not need to code. I am only saying may because to achieve quicker results, you may apply some code and make it easier to share properties among different charts or type of charts. To use a CCC chart, you just need to select the property that you need to change and set its value by using the dropdown or by just setting the value: The previous image shows a group of properties with the respective values on the right side. One of the best ways to start to get used to the CCC properties is to use the CCC page available as part of the Webdetails page: http://www.webdetails.pt/ctools/ccc. There you will find samples and the properties that are being used for each of the charts. You can use the dropdown to select different kinds of charts from all those that are available inside CCC. You also have the ability to change the properties and update the chart to check the result immediately. What I usually do, as it's easier and faster, is to change the properties here and check the results and then apply the necessary values for each of the properties in the CCC charts inside the dashboards. In the following samples, you will also find documentation about the properties, see where the properties are separated by sections of the chart, and after that you will find the extension points. On the site, when you click on a property/option, you will be redirected to another page where you will find the documentation and how to use it. Changing properties in the preExecution or postFetch We are able to change the properties for the charts, as with any other component. Inside the preExecution, this, refers to the component itself, so we will have access to the chart's main object, which we can also manipulate and add, remove, and change options. For instance, you can apply the following code: function() {    var cdProps = {         dotsVisible: true,         plotFrame_strokeStyle: '#bbbbbb',         colors: ['#005CA7', '#FFC20F', '#333333', '#68AC2D']     };     $.extend(true, this.chartDefinition, cdProps); } What we are doing is creating an object with all the properties that we want to add or change for the chart, and then extending the chartDefinitions (where the properties or options are). This is what we are doing with the JQuery function, extending. Use the CCC website and make your life easier This way to apply options makes it easier to set the properties. Just change or add the properties that you need, test it, and when you're happy with the result, you just need to copy them into the object that will extend/overwrite the chart options. Just keep in mind that the properties you change directly in the editor will be overwritten by the ones defined in the preExecution, if they match each other of course. Why is this important? It's because not all the properties that you can apply to CCC are exposed in CDE, so you can use the preExecution to use or set those properties. Handling the click event One important thing about the charts is that they allow interaction. CCC provides a way to handle some events in the chart and click is one of those events. To have it working, we need to change two properties: clickable, which needs to be set to true, and clickAction where we need to write a function with the code to be executed when a click happens. The function receives one argument that usually is referred to as a scene. The scene is an object that has a lot of information about the context where the event happened. From the object you will have access to vars, another object where we can find the series and the categories where the clicked happened. We can use the function to get the series/categories being clicked and perform a fireChange that can trigger updates on other components: function(scene) {     var series =  "Series:"+scene.atoms.series.label;     var category =  "Category:"+scene.vars.category.label;     var value = "Value:"+scene.vars.value.label;     Logger.log(category+"&"+value);     Logger.log(series); } In the previous code example, you can find the function to handle the click action for a CCC chart. When the click happens, the code is executed, and a variable with the click series is taken from scene.atoms.series.label. As well as this, the categories clickedscene.vars.category.label and the value that crosses the same series/category in scene.vars.value.value. This is valid for a crosstab, but you will not find the series when it's non-crosstab. You can think of a scene as describing one instance of visual representation. It is generally local to each panel or section of the chart and it's represented by a group of variables that are organized hierarchically. Depending on the scene, it may contain one or many datums. And you must be asking what a hell is a datum? A datum represents a row, so it contains values for multiple columns. We also can see from the example that we are referring to atoms, which hold at least a value, a label, and a key of a column. To get a better understanding of what I am talking about, you should perform a breakpoint anywhere in the code of the previous function and explore the object scene. In the previous example, you would be able to access to the category, series labels, and value, as you can see in the following table:   Corosstab Non-crosstab Value scene.vars.value.label or scene.getValue(); scene.vars.value.label or scene.getValue(); Category scene.vars.category.label or scene.getCategoryLabel(); scene.vars.category.label or scene.getCategoryLabel(); Series scene.atoms.series.label or scene.getSeriesLabel()   For instance, if you add the previous function code to a chart that is a crosstab where the categories are the years and the series are the territories, if you click on the chart, the output would be something like: [info] WD: Category:2004 & Value:23630 [info] WD: Series:EMEA This means that you clicked on the year 2004 for the EMEA. EMEA sales for the year 2004 were 23,630. If you replace the Logger functions withfireChangeas follows, you will be able to make use of the label/value of the clicked category to render other components and some details about them: this.dashboard.fireChange("parameter", scene.vars.category.label); Internationalization of CCCCharts We already saw that all the values coming from the database should not need to be translated. There are some ways in Pentaho to do this, but we may still need to set the title of a chart, where the title should be also internationalized. Another case is when you have dates where the month is represented by numbers in the base axis, but you want to display the month's abbreviated name. This name could be also translated to different languages, which is not hard. For the title, sub-title, and legend, the way to do it is using the instructions on how to set properties on preExecution.First, you will need to define the properties files for the internationalization and set the properties/translations: var cd = this.chartDefinition; cd.title =  this.dashboard.i18nSupport.prop('BOTTOMCHART.TITLE'); To change the title of the chart based on the language defined, we will need to define a function, but we can't use the property on the chart because that will only allow you to define a string, so you will not be able to use a JavaScript instruction to get the text. If you set the previous example code on the preExecution of the chart then, you will be able to. It may also make sense to change not only the titles, but for instance also internationalize the month names. If you are getting data like 2004-02, this may correspond to a time series format as %Y-%m. If that's the case and you want to display the abbreviated month name, then you may use the baseAxisTickFormatter and the dateFormat function from the dashboard utilities, also known as Utils. The code to write inside the preExecution would be like: var cd = this.chartDefinition; cd.baseAxisTickFormatter = function(label) {   return Utils.dateFormat(moment(label, 'YYYY-mmm'), 'MMM/YYYY'); }; The preceding code uses the baseAxisTickFormatter, which allows you to write a function that receives an argument, identified on the code as a label, because it will store the label for each one of the base axis ticks. We are using the dateFormatmethod and moment to format and return the year followed by the abbreviated month name. You can get information about the language defined and being used by running the following instruction moment.locale(); If you need to, you can change the language. Format a basis axis label based on the scale When you are working with a time series chart, you may want to set a different format for the base axis labels. Let's suppose you want to have a chart that is listening to a time selector. If you select one year old data to be displayed on the chart, certainly you are not interested in seeing the minutes on the date label. However, if you want to display the last hour, the ticks of the base axis need to be presented in minutes. There is an extension point we can use to get a conditional format based on the scale of the base axis. The extension point is baseAxisScale_tickFormatter, and it can be used like in the code as follows: baseAxisScale_tickFormatter: function(value, dateTickPrecision) { switch(dateTickPrecision) { casepvc.time.intervals.y: return format_date_year_tick(value);              break; casepvc.time.intervals.m: return format_date_month_tick(value);              break;            casepvc.time.intervals.H: return format_date_hour_tick(value);              break;          default:              return format_date_default_tick(value);   } } It accepts a function with two arguments: the value to be formatted and the tick precision, and should return the formatted label to be presented on each label of the base axis. The previous code shows howthe function is used. You can see a switch that based on the base axis scale will do a different format, calling a function. The functions in the code are not pre-defined—we need to write the functions or code to create the formatting. One example of a function to format the date is that we could use the utils dateFormat function to return the formatted value to the chart. The following table shows the intervals that can be used when verifying which time intervals are being displayed on the chart: Interval Description Number representing the interval y Year 31536e6 m Month 2592e6 d30 30 days 2592e6 d7 7 days 6048e5 d Day 864e5 H Hour 36e5 m Minute 6e4 s Second 1e3 ms Milliseconds 1 Customizing tooltips CCC provides the ability to change the tooltip format that comes by default, and can be changed using the tooltipFormat property. We can change it, making it look likethe following image, on the right side. You can also compare it to the one on the left, which is the default one: The tooltip default format might change depending on the chart type, but also on some options that you apply to the chart, mainly crosstabMode and seriesInRows. The property accepts a function that receives one argument, the scene, which will be a similar structure as already covered for the click event. You should return the HTML to be showed on the dashboard when we hover the chart. In the previous image,you will see on the chart on the left side the defiant tooltip, and on the right a different tooltip. That's because the following code was applied: tooltipFormat: function(scene){   var year = scene.atoms.series.label;   var territory = scene.atoms.category.value;   var sales = Utils.numberFormat(scene.vars.value.value, "#.00A");   var html = '<html>' + <div>Sales for '+year+' at '+territory+':'+sales+'</div>' + '</html>';   return html; } The code is pretty self-explanatory. First we are setting some variables such as year, territory, and the sales values, which we need to present inside the tooltip. Like in the click event, we are getting the labels/value from the scene, which might depend on the properties we set for the chart. For the sales, we are also abbreviating it, using two decimal places. And last, we build the HTML to be displayed when we hover over the chart. You can also change the base axis tooltip Like we are doing to the tooltip when hovering over the values represented in the chart, we can also baseAxisTooltip, just don't forget that the baseAxisTooltipVisible must be set to true (the value by default). Getting the values to show will pretty similar. It can get more complex, though not much more, when we also want for instance, to display the total value of sales for one year or for the territory. Based on that, we could also present the percentage relative to the total. We should use the property as explained earlier. The previous image is one example of how we can customize a tooltip. In this case, we are showing the value but also the percentage that represents the hovered over territory (as the percentage/all the years) and also for the hovered over year (where we show the percentage/all the territories): tooltipFormat: function(scene){   var year = scene.getSeriesLabel();   var territory = scene.getCategoryLabel();   var value = scene.getValue();   var sales = Utils.numberFormat(value, "#.00A");   var totals = {};   _.each(scene.chart().data._datums, function(element) {     var value = element.atoms.value.value;     totals[element.atoms.category.label] =            (totals[element.atoms.category.label]||0)+value;     totals[element.atoms.series.label] =       (totals[element.atoms.series.label]||0)+value;   });   var categoryPerc = Utils.numberFormat(value/totals[territory], "0.0%");   var seriesPerc = Utils.numberFormat(value/totals[year], "0.0%");   var html =  '<html>' + '<div class="value">'+sales+'</div>' + '<div class="dValue">Sales for '+territory+' in '+year+'</div>' + '<div class="bar">'+ '<div class="pPerc">'+categoryPerc+' of '+territory+'</div>'+ '<div class="partialBar" style="width:'+cPerc+'"></div>'+ '</div>' + '<div class="bar">'+ '<div class="pPerc">'+seriesPerc+' of '+year+'</div>'+ '<div class="partialBar" style="width:'+seriesPerc+'"></div>'+ '</div>' + '</html>';   return html; } The first lines of the code are pretty similar except that we are using scene.getSeriesLabel() in place of scene.atoms.series.label. They do the same, so it's only different ways to get the values/labels. Then the total calculations that are calculated by iterating in all the elements of scene.chart().data._datums, which return the logical/relational table, a combination of the territory, years, and value. The last part is just to build the HTML with all the values and labels that we already got from the scene. There are multiple ways to get the values you need, for instance to customize the tooltip, you just need to explore the hierarchical structure of the scene and get used to it. The image that you are seeing also presents a different style, and that should be done using CSS. You can add CSS for your dashboard and change the style of the tooltip, not just the format. Styling tooltips When we want to style a tooltip, we may want to use the developer's tools to check the classes or names and CSS properties already applied, but it's hard because the popup does not stay still. We can change the tooltipDelayOut property and increase its default value from 80 to 1000 or more, depending on the time you need. When you want to apply some styles to the tooltips for a particular chart you can do by setting a CSS class on the tooltip. For that you should use the propertytooltipClassName and set the class name to be added and latter user on the CSS. Summary In this article,we provided a quick overview of how to use CCC in CDF and CDE dashboards and showed you what kinds of charts are available. We covered some of the base options as well as some advanced option that you might use to get a more customized visualization. Resources for Article: Further resources on this subject: Diving into OOP Principles [article] Python Scripting Essentials [article] Building a Puppet Module Skeleton [article]
Read more
  • 0
  • 0
  • 6801

article-image-scraping-data
Packt
21 Sep 2015
18 min read
Save for later

Scraping the Data

Packt
21 Sep 2015
18 min read
In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives. (For more resources related to this topic, see here.) Analyzing a web page To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option: The data we are interested in is found in this part of the HTML: <table> <tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag: </label></td><td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr> … <tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table> This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version. Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here: This will open a panel showing the surrounding HTML hierarchy of the selected element: In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data. Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html. To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> url = 'http://example.webscraping.com/view/United Kingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] This result shows that the <td class="w2p_fw"> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows: >>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] '244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique: >>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities: >>> re.findall('<tr id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?) </td>', html)[0] '244,820 square kilometres' This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag. From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags:       <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <html> <body> <ul class="country"> <li>Area</li> <li>Population</li> </ul> </body> </html> Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] Beautiful Soup overview Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup: BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter. find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes. find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None. prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks. For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the area from our example country: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/ United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. Lxml Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html. As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> import lxml.html >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = lxml.html.fromstring(broken_html) # parse the HTML >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors. Here is an example using the lxml CSS selectors to extract the area data: >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print area 244,820 square kilometres The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class. CSS selectors CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need: Select any tag: * Select by tag <a>: a Select by class of "link": .link Select by tag <a> with class "link": a.link Select by tag <a> with ID "home": a#home Select by child <span> of tag <a>: a > span Select by descendant <span> of tag <a>: a span Select by tag <a> with attribute title of "Home": a[title=Home] The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors. Note that, internally, lxml converts the CSS selectors into an equivalent XPath. Comparing performance To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country's web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here: Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data: FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') import re def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results from bs4 import BeautifulSoup def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results import lxml.html def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results Scraping results Now that we have complete implementations for each scraper, we will test their relative performance with this snippet: import time NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/ United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper), ('BeautifulSoup', bs_scraper), ('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end – start) This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches. Here are the results from this script on my computer: $ python performance.py Regular expressions: 5.50 seconds BeautifulSoup: 42.84 seconds Lxml: 7.06 seconds The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module! Overview The following table summarizes the advantages and disadvantages of each approach to scraping: Scraping approach Performance Ease of use Ease to install Regular expressions Fast Hard Easy (built-in module) Beautiful Soup Slow Easy Easy (pure Python) Lxml Fast Easy Moderately difficult If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches. Adding a scrape callback to the link crawler Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python: def link_crawler(..., scrape_callback=None): … links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) … The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback. Here is a modified version of the lxml example scraper that can be used for the callback function: def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print url, row This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows: import csv class ScrapeCallback: def __init__(self): self.writer = csv.writer(open('countries.csv', 'w')) self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.writer.writerow(self.fields) def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field)) [0].text_content()) self.writer.writerow(row) To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is "called" as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python's special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names. This code shows how to pass this callback to the link crawler: link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=ScrapeCallback()) Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice: Success! We have completed our first working scraper. Summary In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples. Resources for Article: Further resources on this subject: Scientific Computing APIs for Python [article] Bizarre Python [article] Optimization in Python [article]
Read more
  • 0
  • 0
  • 6792

article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 6789
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-ridge-regression
Packt
16 Dec 2014
9 min read
Save for later

Ridge Regression

Packt
16 Dec 2014
9 min read
In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]],                                    val y: DblVector,                                   val lambda: Double) {                   extends AbstractMultipleLinearRegression                    with PipeOperator[Array[T], Double] {    private var qr: QRDecomposition = null    private[this] val model: Option[RegressionModel] = …    … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x)   //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1   val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price                                .drop(1)                                .zip(_price.take(_price.size -1))                                .map( z => z._1 - z._2)) //2 val data = volatility.get                      .zip(volume.get)                      .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4   regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]
Read more
  • 0
  • 0
  • 6786

article-image-getting-started-pentaho-data-integration
Packt
30 Oct 2013
16 min read
Save for later

Getting Started with Pentaho Data Integration

Packt
30 Oct 2013
16 min read
(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.
Read more
  • 0
  • 0
  • 6774

article-image-visualize
Packt
19 Feb 2015
17 min read
Save for later

Visualize This!

Packt
19 Feb 2015
17 min read
This article is written by Michael Phillips, the author of the book TIBCO Spotfire: A Comprehensive Primer, discusses that human beings are fundamentally visual in the way they process information. The invention of writing was as much about visually representing our thoughts to others as it was about record keeping and accountancy. In the modern world, we are bombarded with formalized visual representations of information, from the ubiquitous opinion poll pie chart to clever and sophisticated infographics. The website http://data-art.net/resources/history_of_vis.php provides an informative and entertaining quick history of data visualization. If you want a truly breathtaking demonstration of the power of data visualization, seek out Hans Rosling's The best stats you've ever seen at http://ted.com. (For more resources related to this topic, see here.) We will spend time getting to know some of Spotfire's data capabilities. It's important that you continue to think about data; how it's structured, how it's related, and where it comes from. Building good visualizations requires visual imagination, but it also requires data literacy. This article is all about getting you to think about the visualization of information and empowering you to use Spotfire to do so. Apart from learning the basic features and properties of the various Spotfire visualization types, there is much more to learn about the seamless interactivity that Spotfire allows you to build in to your analyses. We will be taking a close look at 7 of the 16 visualization types provided by Spotfire, but these 7 visualization types are the most commonly used. We will cover the following topics: Displaying information quickly in tabular form Enriching your visualizations with color categorization Visualizing categorical information using bar charts Dividing a visualization across a trellis grid Key Spotfire concept—marking Visualizing trends using line charts Visualizing proportions using pie charts Visualizing relationships using scatter plots Visualizing hierarchical relationships using treemaps Key Spotfire concept—filters Enhancing tabular presentations using graphical tables Now let's have some fun! Displaying information quickly in tabular form While working through the data examples, we used the Spotfire Table visualization, but now we're going to take a closer look. People will nearly always want to see the "underlying data", the details behind any visualization you create. The Table visualization meets this need. It's very important not to confuse a table in the general data sense with the Spotfire Table visualization; the underlying data table remains immutable and complete in the background. The Table visualization is a highly manipulatable view of the underlying data table and should be treated as a visualization, not a data table. The data used here is BaseballPlayerData.xls There is always more than one way to do the same thing in Spotfire, and this is particularly true for the manipulation of visualizations. Let's start with some very quick manipulations: First, insert a table visualization by going to the Insert menu, selecting New Visualization, and then Table. To move a column, left-click on the column name, hold, and drag it. To sort by a column, left-click on the column name. To sort by more than one column, left-click on the first column name and then press Shift + left-click on the subsequent columns in order of sort precedence. To widen or narrow a column, hover the mouse over the right-hand edge of the column title until you see the cursor change to a two-way arrow, and then click and drag it. These and other properties of the Table visualization are also accessed via visualization properties. As you work through the various Spotfire visualizations, you'll notice that some types have more options than others, but there are common trends and an overall consistency in conventions. Visualization properties can be opened in a number of ways: By right-clicking on the visualization, a table in this case, and selecting Properties. By going to the Edit menu and selecting Visualization Properties. By clicking on the Visualization Properties icon, as shown in the following screenshot, in the icon tray below the main menu bar. It's beyond the scope of this book to explore every property and option. The context-sensitive help provided by Spotfire is excellent and explains all the options in glorious detail. I'd like to highlight four important properties of the Table visualization: The General property allows you to change the table visualization title, not the name of the underlying data table. It also allows you to hide the title altogether. The Data property allows you to switch the underlying data table, if you have more than one table loaded into your analysis. The Columns property allows you to hide columns and order the columns you do want to show. The Show/Hide Items property allows you to limit what is shown by a rule you define, such as top five hitters. After clicking on the Add button, you select the relevant column from a dropdown list, choose Rule type (Top), and finally, choose Value for the rule (5). The resulting visualization will only show the rows of data that meet the rule you defined. Enriching your visualizations with color categorization Color is a strong feature in Spotfire and an important visualization tool, often underestimated by report creators. It can be seen as merely a nice-to-have customization, but paying attention to color can be the difference between creating a stimulating and intuitive data visualization rather than an uninspiring and even confusing corporate report. Take some pride and care in the visual aesthetics of your analytics creations! Let's take a look at the color properties of the Table visualization. Open the Table visualization properties again, select Colors, and then Add the column Runs. Now, you can play with a color gradient, adding points by clicking on the Add Point button and customizing the colors. It's as easy as left-clicking on any color box and then selecting from a prebuilt palette or going into a full RGB selection dialog by choosing More Colors…. The result is a heatmap type effect for runs scored, with yellow representing low run totals, transitioning to green as the run total approaches the average value in the data, and becoming blue for the highest run totals. Visualizing categorical information using bar charts We saw how the Table visualization is perfect for showing and ordering detailed information. It's quite similar to a spreadsheet. The Bar Chart visualization is very good for visualizing categorical information, that is, where you have categories with supporting hard numbers—sales by region, for example. The region is the category, whereas the sales is the hard number or fact. Bar charts are typically used to show a distribution. Depending on your data or your analytic requirement, the bars can be ordered by value, placed side by side, stacked on top of each other, or arranged vertically or horizontally. There is a special case of the category and value combination and that is where you want to plot the frequencies of a set of numerical values. This type of bar chart is referred to as a histogram, and although it is number against number, it is still, in essence, a distribution plot. It is very common in fact to transform the continuous number range in such cases into a set of discrete bins or categories for the plot. For example, you could take some demographic data and plot age as the category and the number of people at that age as the value (the frequency) on a bar chart. The result, for a general population, would approach a bell-shaped curve. Let's create a bar chart using the baseball data. The data we will use is BaseballPlayerData.xls, which you can download from http://www.insidespotfire.com. Create a new page by right-clicking on any page tab and selecting New Page. You can also select New Page from the Insert menu or click on the new page icon in the icon bar below the main menu. Create a Bar Chart visualization by left-clicking on the bar chart icon or by selecting New Visualization and then Bar Chart from the Insert menu. Spotfire will automatically create a default chart, that is, rarely exactly what you want, so the next step is to configure the chart. Two distributions might be interesting to look at: the distribution of home runs across all the teams and the distribution of player salaries across all the teams. The axes are easy to change; simply use the axes selectors.   If the bars are vertical, it means that the category—Team, in our case—should be on the horizontal axis, with the value—Home Runs or Salary—on the vertical axis, representing the height of the bars.   We're going to pick Home Runs from the vertical axis selector and then an appropriate aggregation dropdown, which is highlighted in red in the screenshot. Sum would be a valid option, but let's go with Avg (Average). Similarly, select Team from the horizontal axis dropdown selector. The vertical, or value, axis must be an aggregation because there is more than one home run value for each category. You must decide if you want a sum, an average, a minimum, and so on. You can modify the visualization properties just as you did for the Table visualization. Some of the options are the same; some are specific to the bar chart. We're going to select the Sort bars by value option in the Appearance property. This will order the bars in descending order of value. We're also going to check the option Vertically under Scale labels | Show labels for the Category Axis property. There are two more actions to perform: create an identical bar chart except with average salary as the value axis, and give each bar chart an appropriate title (Visualization Properties|General|Title:). To copy an existing visualization, simply right-click on it and select Duplicate Visualization. We can now compare the distribution of home run average and salary average across all the baseball teams, but there's a better way to do this in a single visualization using color. Close the salary distribution bar chart by left-clicking on X in the upper right-hand corner of the visualization (X appears when you hover the mouse) or right-clicking on the visualization and selecting Close. Now, open the home run bar chart visualization properties, go to the Colors property, and color by Avg(Salary). Select a Gradient color mode, and add a median point by clicking on the Add Point button and selecting Median from the dropdown list of options on the added point. Finally, choose a suitable heat map range of colors; something like blue (min) through pale yellow (median) through red (max). You will still see the distribution of home runs across the baseball teams, but now you will have a superimposed salary heat map. Texas and Cleveland appear to be getting much more bang for their buck than the NY Yankees. Dividing a visualization across a trellis grid Trellising, whereby you divide a series of visualizations into individual panels, is a useful technique when you want to subdivide your analysis. In the example we've been working with, we might, for instance, want to split the visualization by league. Open the visualization properties for the home runs distribution bar chart colored by salary and select the Trellis property. Go to Panels and split by League (use the dropdown column selector). Spotfire allows you to build layers of information with even basic visualizations such as the bar chart. In one chart, we see the home run distribution by team, salary distribution by team, and breakdown by league. Key Spotfire concept – marking It's time to introduce one of the most important Spotfire concepts, called marking, which is central to the interactivity that makes Spotfire such a powerful analysis tool. Marking refers to the action of selecting data in a visualization. Every element you see is selectable, or markable, that is, a single row or multiple rows in a table, a single bar or multiple bars in a bar chart. You need to understand two aspects to marking. First, there is the visual effect, or color(s) you see, when you mark (select) visualization elements. Second, there is the behavior that follows marking: what happens to data and the display of data when you mark something. How to change the marking color From Spotfire v5.5 onward, you can choose, on a visualization-by-visualization basis, two distinct visual effects for marking: Use a separate color for marked items: all marked items are uniformly colored with the marking color, and all unmarked items retain their existing color. Keep existing color attributes and fade out unmarked items: all marked items keep their existing color, and all unmarked items also keep their existing color but with a high degree of color fade applied, leaving the marked items strongly highlighted. The second option is not available in versions older than v5.5 but is the default option in Versions 5.5 onward. The setting is made in the visualization's Appearance property by checking or unchecking the option Use separate color for marked items. The default color when using a separate color for marked items is dark green, but this can be changed by going to Edit|Document Properties|Markings|Edit. The new option has the advantage of retaining any underlying coloring you defined, but you might not like how the rest of the chart is washed out. Which approach you choose depends on what information you think is critical for your particular situation. When you create a new analysis, a default marking is created and applied to every visualization you create by default. You can change the color of the marking in Document Properties, which is found in the Edit menu. Just open Document Properties, click on the Markings tab, select the marking, click on the Edit button, and change the color. You can also create as many markings as you need, giving them convenient names for reference purposes, but we'll just focus on using one for now. How to set the marking behavior of a visualization Marking behavior depends fundamentally on data relationships. The data within a single data table is intrinsically related; the data in separate data tables must be explicitly related before you configure marking behavior for visualizations based on separate datasets. When you mark something in a visualization, five things can happen depending on the data involved and how you configured your visualizations: Conditions Behavior Two visualizations with the same underlying data table (they can be on different pages in the analysis file) and the same marking scheme applied. Marking data on one visualization will automatically mark the same data on the other. Two visualizations with related underlying data tables and the same marking scheme applied. The same as the previous condition's behavior, but subject to differences in data granularity. For example, marking a baseball team in one visualization will mark all the team's players in another visualization that is based on a more detailed table related by team. Two visualizations with the same or related data tables where one has been configured with data dependency on the marking in the other. Nothing will display in the marking-dependent visualization other than what is marked in the reference visualization. Visualizations with unrelated underlying data tables. No marking interaction will occur, and the visualizations will mark completely independently of one another. Two visualizations with the same underlying data table or related data tables and with different marking schemes applied. Marking data on one visualization will not show on the other because the marking schemes are different. Here's how we set these behaviors: Open the visualization properties of the bar chart we have been working with and navigate to the Data property.   You'll notice that two settings refer to marking: Marking and Limit data using markings. Use the dropdown under Marking to select the marking to be used for the visualization. Having no marking is an option. Visualizations with the same marking will display synchronous selection, subject to the data relation conditions described earlier. The options under Limit data using markings determine how the visualization will be limited to marking elsewhere in the analysis. The default here is no dependency. If you select a marking, then the visualization will only display data selected elsewhere with that marking. It's not good to have the same marking for Marking and Limit data using markings. If you are using the limit data setting, select no marking, or create a second marking and select it under Marking. You're possibly a bit confused by now. Fortunately, marking is much harder to describe than to use! Let's build a tangible example. We'll start a new analysis, so close any analysis you have open and create a new one, loading the player-level baseball data (BaseballPlayerData.xls). Add two bar charts and a table. You can rearrange the layout by left-clicking on the title bar of a visualization, holding, and dragging it. Position the visualizations any way you wish, but you can place the two bar charts side by side, with the table below them spanning both. Save your analysis file at this point and at regular intervals. It's good behavior to save regularly as you build an analysis. It will save you a lot of grief if your PC fails in any way. There is no autosave function in Spotfire. For the first bar chart, set the following visualization properties: Property Value General | Title Home Runs Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Vertical bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Avg(Home Runs) Colors | Columns Avg(Salary) Colors | Color mode Gradient Add Point for median Max = strong red; Median = pale yellow; Min = strong blue Labels | Show labels for Marked Rows Labels | Types of labels | Complete bar Check For the second bar chart, set the following visualization properties: Property Value General | Title Roster Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Horizontal bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Count(Player Name) Colors | Columns Position Colors | Color mode Categorical For the table, set the following visualization properties: Property Value General | Title Details Data | Marking (None) Data | Limit data using markings Check Marking   Columns Team, Player Name, Games Played, Home Runs, Salary, Position Now start selecting visualization elements with your mouse. You can click on elements such as bars or segments of bars, or you can click and drag a rectangular block around multiple elements. When you select a bar on the Home Runs bar chart, the corresponding team bar automatically selects the Roster bar chart, and details for all the players in that team display in the Details table. When you select a bar segment on the Roster bar chart, the corresponding team bar automatically selects on the Home Runs bar chart and only players in the selected position for the team selected appear in the details. There are some very useful additional functions associated with marking, and you can access these by right-clicking on a marked item. They are Unmark, Invert, Delete, Filter To, and Filer Out. You can also unmark by left-clicking on any blank space in the visualization. Play with this analysis file until you are comfortable with the marking concept and functionality. Summary This article is a small taste of the book TIBCO Spotfire: A comprehensive primer. You've seen how the Table visualization is an easy and traditional way to display detailed information in tabular form and how the Bar Chart visualization is excellent for visualizing categorical information, such as distributions. You've learned how to enrich visualizations with color categorization and how to divide a visualization across a trellis grid. You've also been introduced to the key Spotfire concept of marking. Apart from gaining a functional understanding of these Spotfire concepts and techniques, you should have gained some insight into the science and art of data visualization. Resources for Article: Further resources on this subject: The Spotfire Architecture Overview [article] Interacting with Data for Dashboards [article] Setting Up and Managing E-mails and Batch Processing [article]
Read more
  • 0
  • 0
  • 6769

article-image-first-steps-r
Packt
30 Jul 2013
6 min read
Save for later

First steps with R

Packt
30 Jul 2013
6 min read
(For more resources related to this topic, see here.) Obtaining and installing R The way to obtain R is downloading it from the CRAN website (http://www.r-project.org/). The Comprehensive R Archive Network (CRAN) is a network of FTP and web servers around the world that stores identical, up-to-date versions of code and documentation for R. The CRAN is directly accessible from the R website and on such website it is also possible to find information about R, some technical manuals, the R journal, and details about the packages developed for R and stored on the CRAN repositories. The functionalities of the R environment can then also be expanded thanks to software libraries which can be installed and recalled if needed. These libraries or packages are a collection of source code and other additional files that, when installed in R, allow the user to load them in the workspace via a call to the library() function. An example of code to load the package lattice may be found as follows: > library(lattice) An R installation contains one or more libraries of packages. Some of these packages are part of the basic installation and are loaded automatically as soon as the session is started. Other can be installed from the CRAN, the official R repository, or downloaded and installed manually. Interacting with the console As soon as you will start R, you will see that a workspace is open; you can see a screenshot of the R Console window in the image below. The workspace is the environment in which you are working, where you will load your data, and create your variables. The screen prompt > is the R prompt that waits for commands. On the starting screen, you can either type any function, command, or you can use R to perform basic calculation. R uses the usual symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^). Parentheses ( ) can be used to specify the order of operations. R also provides %% for taking the modulus and %/% for integer division. Comments in R are defined by the character #, so everything after such character up to the end of the line will be ignored by R. R has a number of built-in functions, for example, sin(x), cos(x), tan(x), (all in radians), exp(x), log(x), and sqrt(x). Some special constants such as pi are also pre-defined. You can see an example of the use of such function in the following code: > exp(2.5)[1] 12.18249 Understanding R objects In every computer language, variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures called objects. These objects are referred to through symbols or variables. Vectors The basic object in R is the vector; even scalars are vectors of length one. Vectors can be thought of as a series of data of the same class. There are six basic vector type (called atomic vectors): logical, integer, real, complex, string (or character), and raw. Integer and real represent numeric objects; logicals are Boolean data type with possible value TRUE or FALSE. Among such atomic vectors, the more common ones are logical, string, and numeric (integer and real). There are several ways to create vectors. For instance the operator : (colon) is a sequence-generating operator, it creates sequences by incrementing or decrementing by one. > 1:10 [1] 1 2 3 4 5 6 7 8 9 10> 5:-6 [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 If the interval between the numbers is not one, you can use the seq() function. Here an example > seq(from=2, to=2.5, by=0.1)[1] 2.0 2.1 2.2 2.3 2.4 2.5 One of the more important features of R is the possibility to use entire vector as arguments of functions, thus avoiding the use of cyclic loops. Most of the functions in R allow the use of vector as argument, as example the use of some of these functions is reported as follows > x <- c(12,10,4,6,9)> max(x)[1] 12> min(x)[1] 4> mean(x)[1] 8.2 Matrices and arrays In R, the matrix notation is extended to elements of any kind, so in example it is possible to have a matrix of character strings. Matrices and arrays are basically vectors with a dimension attribute. The function matrix() may be used to create matrices. By default, such function creates the matrix by column; as alternative it is possible to specify to the function to build the matrix by row: > matrix(1:9,nrow=3,byrow=TRUE) [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9 Lists A list in R is a collection of different objects. One of the main advantages of lists is that the object contained within a list may be of different type, for example, numeric and character values. In order to define a list, you simply will need to provide the object that you want to include as argument of the function list(). Data frame A data frame corresponds to a data set; it is basically a special list in which the elements have the same length. Elements may be different type in different columns, but within the same column all the elements are of the same type. You can easily create data frames using the function data.frame(), and a specific column can be recall using the operator $. Top features you’ll want to know about In addition to the basic object creation and manipulation, many more complex tasks can be performed with R, spanning from data manipulation, programming, statistical analysis and the realization of very high quality graphs. Some of the most useful features are Data input and output Flow control (for, if…else, while) Create your own functions Debugging functions and handling exceptions Plotting data Summary In this article we saw what is R, how to obtain and install R, and how to interacting with the console. We also saw at few R objects and also looked at the top features you would want to know about Resources for Article: Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R [Article]
Read more
  • 0
  • 0
  • 6724
article-image-setting-most-popular-journal-articles-your-personalized-community-liferay-portal
Packt
21 Oct 2009
6 min read
Save for later

Setting up the most Popular Journal Articles in your Personalized Community in Liferay Portal

Packt
21 Oct 2009
6 min read
Personal community is a dynamic feature of Liferay portal. By default, the personal community is a portal-wide setting that will affect all of the users. It would be nice to have more features in the personal community such as showing the most popular journal articles. This article by Jonas Yuan will address how to set up the most popular journal articles in you personalized community and view the counter for other assets. In a web site, we will have a lot of journal articles (that is, web content) for a given article type. For example, for the article type Article Content, we will have articles talking about product family. We may want to know how many times the end users read each article. Meanwhile, it would be nice if we could show the most popular articles (for example, TOP 10 articles) for this given article type. As shown in the following screenshot, a journal article My EDI Product I is shown via a portlet Ext Web Content Display. Rating and comments on this article are also exhibited. At the same time, the medium-size image, polls, and related content of this article are listed, too. A view counter of this article is especially displayed under the ratings. Moreover, the most popular articles are exhibited with article title and number of views under related content. All these articles belong to the article type article-content. That is, the article in the current portlet Ext Web Content Display has the most popular articles only for the article type article-content. Of course, you can customize the portlet Web Content Display directly through changing JSP files. For demo purposes, we will implement the view counter in the portlet Ext Web Content Display. Meanwhile, we will implement the mostly popular articles via VM services and article templates. In addition, we will analyze the view counter for other assets such as Image Gallery images, Document Library documents, Wiki articles, Blog entries, Message Boards threads, and so on. Adding a view counter in the Web Content Display portlet First of all, let's add a view counter in the Ext Web Content Display portlet. As the function of view counter for assets (including journal articles) is provided in the model TagsAssetModel of the com.liferay.portlet.tags.model package in the /portal/portal-service/src folder, we could use this feature in this portlet directly. To do so, use the following steps: Create a folder journal_content in the folder /ext/ext-web/docroot/html/portlet/. Copy the JSP file view.jsp in the folder /portal/portal-web/docroot/html/portlet/ to the folder /ext/ext-web/docroot/html/portlet/journal_content and open it. Add the line <%@ page import="com.liferay.portlet.tags.model.TagsAsset" %> after the line <%@ include file="/html/portlet/journal_content/init.jsp" %>, and check the following lines: JournalArticleDisplay articleDisplay = (JournalArticleDisplay) request.getAttribute( WebKeys.JOURNAL_ARTICLE_DISPLAY); if (articleDisplay != null) { TagsAssetLocalServiceUtil.incrementViewCounter( JournalArticle.class.getName(), articleDisplay.getResourcePrimKey());} Then add the following lines after the line <c:if test="<%=enableComments %>"> and save it: <span class="view-count"> <% TagsAsset asset = TagsAssetLocalServiceUtil.getAsset (JournalArticle.class.getName(), articleDisplay.getResourcePrimKey());%> <c:choose> <c:when test="<%= asset.getViewCount() == 1 %>"> <%= asset.getViewCount() %> <liferay-ui:message key="view" />, </c:when> <c:when test="<%= asset.getViewCount() > 1 %>"> <%= asset.getViewCount() %> <liferay-ui:message key="views" />, </c:when> </c:choose></span> The code above shows a way to increase the view counter via the TagsAssetLocalServiceUtil.incrementViewCounter method. This method takes two parameters className and classPK as inputs. For the current journal article, the two parameters are JournalArticle.class.getName() and articleDisplay.getResourcePrimKey(). Then, this code shows a way to display view counted through the TagsAssetLocalServiceUtil.getAsset method. Similarly, this method also takes two parameters, className and classPK, as inputs. This approach would be useful for other assets, as the className parameter could be Image Gallery, Document Library, Wiki, Blogs, Message Boards, Bookmark, and so on. Setting up VM service We can set up the VM service to exhibit the most popular articles. We can also add the getMostPopularArticles method in the custom velocity tool ExtVelocityToolUtil. To do so, first add the following method in the ExtVelocityToolService interface: public List<TagsAsset> getMostPopularArticles(String companyId, String groupId, String type, int limit); And then add an implementation of the getMostPopularArticles method in the ExtVelocityToolServiceImpl class as follows: public List<TagsAsset> getMostPopularArticles(String companyId, String groupId, String type, int limit) { List<TagsAsset> results = Collections.synchronizedList(new ArrayList<TagsAsset>()); DynamicQuery dq0 = DynamicQueryFactoryUtil.forClass( JournalArticle.class, "journalarticle"). setProjection(ProjectionFactoryUtil.property ("resourcePrimKey")).add(PropertyFactoryUtil. forName("journalarticle.companyId"). eqProperty("tagsasset.companyId")). add(PropertyFactoryUtil.forName( "journalarticle.groupId").eqProperty( "tagsasset.groupId")).add(PropertyFactoryUtil. forName("journalarticle.type").eq( "article-content")); DynamicQuery query = DynamicQueryFactoryUtil.forClass( TagsAsset.class, "tagsasset") .add(PropertyFactoryUtil.forName( "tagsasset.classPK").in(dq0)) .addOrder(OrderFactoryUtil.desc( "tagsasset.viewCount")); try{ List<Object> assets = TagsAssetLocalServiceUtil. dynamicQuery(query); int index = 0; for (Object obj: assets) { TagsAsset asset = (TagsAsset) obj; results.add(asset); index ++; if(index == limit) break; } } catch (Exception e){ return results; } return results; } The preceding code shows a way to get the most popular articles by company ID, group ID, article type, and limited articles to be returned. DynamicQuery API allows us to leverage the existing mapping definitions through access to the Hibernate session. For example, DynamicQuery dq0 selects the journal articles by companyID, groupId, and type; DynamicQuery query selects tagsassets by classPK, which exists in DynamicQuery dq0; and tagsassets are ordered by viewCount as well. Finally, add the following method to register the above method in ExtVelocityToolUtil: public List<TagsAsset> getRelatedArticles(String companyId, String groupId, String articleId, int limit){ return _extVelocityToolService.getRelatedArticles(companyId, groupId, articleId, limit);} The code above shows a generic approach to get TOP 10 articles for any article types. Of course, you can extend this approach to find TOP 10 assets. This can include Image Gallery images, Document Library documents, Wiki articles, Blog entries, Message Boards threads, Bookmark entries, slideshow, videos, games, video queue, video list, playlist, and so on. You may practice these TOP 10 assets feature. Building article template for the most popular journal articles We have added view counter on journal articles. We have already built VM service for the most popular articles too. Now let's build an article template for them. Setting up the default article type As mentioned earlier, there is a set of types of journal articles, for example, announcements, blogs, general, news, press-release, updates, article-tout, article-content, and so on. In real case, only some of these types will require view counter, for example article-content. Let's configure the default article type for mostly popular articles. We can add the following line at the end of portal-ext.properties. ext.most_popular_articles.article_type=article-content The code above shows that the default article type for most_popular_articles is article-content.
Read more
  • 0
  • 0
  • 6716

article-image-configuring-and-formatting-ireport-elements
Packt
29 Mar 2010
7 min read
Save for later

Configuring and Formatting iReport Elements

Packt
29 Mar 2010
7 min read
A complete report is structured by composing a set of sections called bands. Each band has its own configurable height, a particular position in the structure, and is used for a particular objective. The available bands are: Title, Page Header, Column Header, Detail 1, Column Footer, Page Footer, Last Footer, and Summary. A report structured with bands is shown in the following screenshot: Besides the mentioned bands, there are two special bands which are Background and No Data. Band Description Title Is the first band of the report and is printed only once. Title can be shown on a new page. You can configure this from the report properties discussed in the previous section of this chapter. Just to review-go to report Properties | More... and check the Title on a new page checkbox. Page Header Is printed on each page of the report and is used for setting up the page header. Column Header Is printed on each page, if there is a detail band on that page. This band is used for the column heading. Detail This band is repeatedly printed for each row in the data source. In the List of Products report, it is printed for each product record. Column Footer Is printed on each page if there is a detail band on that page. This band is used for the column heading. If the Floating column footer in report Properties is checked, then the column footer will be shown just below the last data of the column, otherwise it will be shown at the bottom of the page (above the page footer). Page Footer Is printed on each page except the last page, if Last Page Footer is set. If Last Page Footer is not set, then it is printed on the last page also. This band is a good place to insert page numbers. Last Page Footer Is printed only on the last page as a page footer. Summary Is printed only once at the end of the report. It can be printed on a separate page if it is configured from the report Properties. In the following chapters, we will produce some reports where you will learn about the suitability of this band. Background Is used for setting a page background. For example, we may want a watermark image for the report pages. No Data When no data is available for the reports, this band is printed if it is set as the When no data option in the report Properties. Showing/hiding bands and inserting elements Now, we are going to configure the report bands (setting height, visibility, and so on) and format the report elements. Select Column Footer from the Report Inspector. You will see the Column Footer - Properties on the right of the designer. Type 25 in the Band height field. Press Enter. Now you can see the Column Footer band in your report, which was invisible before you set the band height. A band becomes invisible in the report if its height is set to zero. We have already learned how to change the height of a band. We can also make a band invisible using the Print When Expression option. If we write new Boolean(false) in Print When Expression of a band, then that will make the band invisible, even though its height is set to greater than zero. If we write new Boolean(true), then the band will be visible. It is true by default. Drag a Static Text element from the Palette window and drop it on the Column Footer band. Double-click on Static Text and type End of Record, replacing the text Static Text. Select the static text element (End of Record). Go to Format | Position and then choose Center. Now the element has been positioned in the center of the Column Footer band. In the same way, insert two Line elements. Place one element at the left and another at the right of the static text. Select both the lines. Go to Format | Position, and then choose Center Vertically . The lines are now positioned in the center of the Column Footer vertically. Select both the lines and go to Format | Size and then choose Same Width. Now both the lines are equal in width. Select the static text element (End of Record) and the left line. Now go to Format | Position and choose Join Sides Right. This moves the line to the right, and it is now connected to the static text element. Repeat the previous step for the right line and finally choose Join Sides Left. Now the line has moved to the left and is connected with the static text element. In the same way, change the column headers as you want by double-clicking the labels on the Column Header band. Now, the columns may be Product Code, Name, and Description. Now your report design should look like the following screenshot: Preview the report, and you will see the lines and static text (End of Record) at the bottom of the column. By default, the Column Footer is placed at the bottom of the page. To show the Column Footer just below the table of data, the Float column footer option must be enabled from the report Properties window. Sizing elements We can increase or decrease the size of an element by dragging the mouse accordingly. Sometimes, we need to set the size of an element automatically based on other elements' sizes. There are various options for setting the automatic size of an element. These options are available in the format menu (Format | Size). Size Options Description Same Width This makes the selected elements of the same width. The width of the element that you select first is used as the new width of the selected elements. Same Width (max) The width of the largest of the selected elements is set as the width of all the selected elements. Same Width (min) The width of the smallest of the selected elements is set as the width of all the selected elements. Same Height This makes the selected elements of the same height. The height of the element that you select first is used as the new height of the selected elements. Same Height (max) The height of the largest of the selected elements is set as the height of all the selected elements. Same Height (min) The height of the smallest of the selected elements is set as the height of all the selected elements. Same Size Both the width and the height of the selected elements become the same. Position Description Center Horizontally (band/cell based) The selected element is placed in the center of the band horizontally. Center Vertically (band/cell based) The selected element is placed in the center of the band vertically. Center (in band/cell) The selected element is placed in the center of the band both horizontally and vertically. Center (in background) If the Background band is visible and if the element is on the Background band, then it will be placed in the center both horizontally and vertically. Join Left Joins two elements. For joining, one element will be moved to the left. Join Right Joins two elements. For joining, one element will be moved to the right. Align to Left Margin The selected element will be joined with the left margin of the report. Align to Right Margin The selected element will be joined with the right margin of the report.
Read more
  • 0
  • 0
  • 6711

article-image-use-stylesheets-report-designing-using-birt
Packt
17 Jul 2010
3 min read
Save for later

Use of Stylesheets for Report Designing using BIRT

Packt
17 Jul 2010
3 min read
Stylesheets BIRT, being a web-based reporting environment, takes a page from general web development toolkits by importing stylesheets. However, BIRT stylesheets function slightly differently to regular stylesheets in a web development environment. We are going to add on to the Customer Orders report we have been working with, and will create some styles that will be used in this report. Open Customer Order.rptDesign. Right-click on the getCustomerInformation dataset and choose Insert into Layout. Modify the table visually to look like the next figure. Create a new dataset called getCustomerOrders using the following query: //insert code 1 Link the dataset parameter to rprmCustomerID. Save the dataset, right-click on it, and select Insert to layout. Select the first ORDERNUMBER column. Under the Property Editor, Select Advanced. In the Property Editor, go to the Suppress duplicates option, and change it to true. This will prevent the OrderNumber data item from repeating the value it displays down the page. In the Outline, right-click on Styles and choose New Style…. In the Pre-Defined Style drop down, choose table-header. A predefined style is an element that is already defined in the BIRT report. When selecting a predefined style, this will affect every element of that type within a report. In this case, for every table in the report, the table header will have this style applied. Under the Font section, apply the following settings: Font: Sans-Serif Font Color: White Size: Large Weight: Bold Under the Background section, set the Background Color to >b>Black. Click OK. Now, when we run the report, we can see that the header line is formatted with a black background and white font. Custom stylesheets In the example we just saw, we didn't have to apply this style to any element, it was automatically applied to the header of the order details table as it was using a predefined style. This would be the case for any table that had the header row populated with something and the same is the case for any of the predefined styles in BIRT. So next, let's look at a custom defined style and apply it to our customer information table. Right-click on the Styles section under the Outline tab and create a new style. Under the Custom Style textbox, enter CustomerHeaderInfo. Under the Font section, enter the following information: Font: Sans Serif Color: White Size: Large Weight: Bold Under the Background section, set the Background Color to Gray. Under the Box section, enter 1 points for all sections. Under the Border section, enter the following information: Style (All): Solid Color (All): White Width (All): Thin Click OK and then click Save. Select the table which contains the customer information. Select the first column. Under the Property Editor, in the list box for the Styles, select CustomerHeaderInfo. The preview report will look like the following screenshot: Right-click on the Styles section, and create a new custom style called CustomerHeaderData. Under Box, put in 1 points for all fields. Under Border, enter the following information: Style – Top: Solid Style – Bottom: Solid Color (All): Gray Click OK. Select the Customer Information table. Select the second column. Right-click on the column selector and select Style | Apply Style | CustomHeaderData. The finished report should look something like the next screenshot:
Read more
  • 0
  • 0
  • 6694
article-image-integrating-kettle-and-pentaho-suite
Packt
14 Jul 2011
13 min read
Save for later

Integrating Kettle and the Pentaho Suite

Packt
14 Jul 2011
13 min read
  Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle       Introduction Kettle, also known as PDI, is mostly used as a stand-alone application. However, it is not an isolated tool, but part of the Pentaho Business Intelligence Suite. As such, it can also interact with other components of the suite; for example, as the datasource for a report, or as part of a bigger process. This chapter shows you how to run Kettle jobs and transformations in that context. The article assumes a basic knowledge of the Pentaho BI platform and the tools that made up the Pentaho Suite. If you are not familiar with these tools, it is recommended that you visit the wiki page (wiki.pentaho.com) or the Pentaho BI Suite Community Edition (CE) site: http://community.pentaho.com/. As another option, you can get the Pentaho Solutions book (Wiley) by Roland Bouman and Jos van Dongen that gives you a good introduction to the whole suite. A sample transformation The different recipes in this article show you how to run Kettle transformations and jobs integrated with several components of the Pentaho BI suite. In order to focus on the integration itself rather than on Kettle development, we have created a sample transformation named weather.ktr that will be used through the different recipes. The transformation receives the name of a city as the first parameter from the command line, for example Madrid, Spain. Then, it consumes a web service to get the current weather conditions and the forecast for the next five days for that city. The transformation has a couple of named parameters: The following diagram shows what the transformation looks like: It receives the command-line argument and the named parameters, calls the service, and retrieves the information in the desired scales for temperature and wind speed. You can download the transformation from the book's site and test it. Do a preview on the next_days, current_conditions, and current_conditions_normalized steps to see what the results look like. The following is a sample preview of the next_days step: The following is a sample preview of the current_conditions step: Finally, the following screenshot shows you a sample preview of the current_conditions_normalized step: There is also another transformation named weather_np.ktr. This transformation does exactly the same, but it reads the city as a named parameter instead of reading it from the command line. The Getting ready sections of each recipe will tell you which of these transformations will be used. Avoiding consuming the web service It may happen that you do not want to consume the web service (for example, for delay reasons), or you cannot do it (for example, if you do not have Internet access). Besides, if you call a free web service like this too often, then your IP might be banned from the service. Don't worry. Along with the sample transformations on the book's site, you will find another version of the transformations that instead of using the web service, reads sample fictional data from a file containing the forecast for over 250 cities. The transformations are weather (file version).ktr and weather_np (file version).ktr. Feel free to use these transformations instead. You should not have any trouble as the parameters and the metadata of the data retrieved are exactly the same as in the transformations explained earlier. If you use transformations that do not call the web service, remember that they rely on the file with the fictional data (weatheroffline.txt). Wherever you copy the transformations, do not forget to copy that file as well. Creating a Pentaho report with data coming from PDI The Pentaho Reporting Engine allows designing, creating, and distributing reports in various popular formats (HTML, PDF, and so on) from different kind of sources (JDBC, OLAP, XML, and so on). There are occasions where you need other kinds of sources such as text files or Excel files, or situations where you must process the information before using it in a report. In those cases, you can use the output of a Kettle transformation as the source of your report. This recipe shows you this capability of the Pentaho Reporting Engine. For this recipe, you will develop a very simple report: The report will ask for a city and a temperature scale and will report the current conditions in that city. The temperature will be expressed in the selected scale. Getting ready A basic understanding of the Pentaho Report Designer tool is required in order to follow this recipe. You should be able to create a report, add parameters, build a simple report, and preview the final result. Regarding the software, you will need the Pentaho Report Designer. You can download the latest version from the following URL: http://sourceforge.net/projects/pentaho/files/Report%20Designer/ You will also need the sample transformation weather.ktr. The sample transformation has a couple of UDJE steps. These steps rely on the Janino library. In order to be able to run the transformation from Report Designer, you will have to copy the janino.jar file from the Kettle libext directory into the Report Designer lib directory. How to do it... In the first part of the recipe, you will create the report and define the parameters for the report: the city and the temperature scale. Launch Pentaho Report Designer and create a new blank report. Add two mandatory parameters: A parameter named city_param, with Lisbon, Portugal as Default Value and a parameter named scale_param which accepts two possible values: C meaning Celsius or F meaning Fahrenheit. Now, you will define the data source for the report: In the Data menu, select Add Data Source and then Pentaho Data Integration. Click on the Add a new query button. A new query named Query 1 will be added. Give the query a proper name, for example, forecast. Click on the Browse button. Browse to the sample transformation and select it. The Steps listbox will be populated with the names of the steps in the transformation. Select the step current_conditions. So far, you have the following: The specification of the transformation file name with the complete path will work only inside Report Designer. Before publishing the report, you should edit the file name (C:Pentahoreportingweather.ktr in the preceding example) and leave just a path relative to the directory where the report is to be published (for example, reportsweather.ktr). Click on Preview; you will see an empty resultset. The important thing here is that the headers should be the same as the output fields of the current_conditions step: city, observation_time, weatherDesc, and so on. Now, close that window and click on Edit Parameters. You will see two grids: Transformation Parameter and Transformation Arguments. Fill in the grids as shown in the following screenshot. You can type the values or select them from the available drop-down lists: Close the Pentaho Data Integration Data Source window. You should have the following: The data coming from Kettle is ready to be used in your report. Build the report layout: Drag and drop some fields into the canvas and arrange them as you please. Provide a title as well. The following screenshot is a sample report you can design: Now, you can do a Print Preview. The sample report above will look like the one shown in the following screenshot: Note that the output of the current_condition step has just one row. If for data source you choose the next_days or the current_condition_normalized step instead, then the result will have several rows. In that case, you could design a report by columns: one column for each field. How it works... Using the output of a Kettle transformation as the data source of a report is very useful because you can take advantage of all the functionality of the PDI tool. For instance, in this case you built a report based on the result of consuming a web service. You could not have done this with Pentaho Report Designer alone. In order to use the output of your Kettle transformation, you just added a Pentaho Data Integration datasource. You selected the transformation to run and the step that would deliver your data. In order to be executed, your transformation needs a command-line parameter: the name of the city. The transformation also defines two named parameters: the temperature scale and the wind scale. From the Pentaho Report Designer you provided both—a value for the city and a value for the temperature scale. You did it by filling in the Edit Parameter setting window inside the Pentaho Data Integration Data Source window. Note that you did not supply a value for the SPEED named parameter, but that is not necessary because Kettle uses the default value. As you can see in the recipe, the data source created by the report engine has the same structure as the data coming from the selected step: the same fields with the same names, same data types, and in the same order. Once you configured this data source, you were able to design your report as you would have done with any other kind of data source. Finally, when you are done and want to publish your report on the server, do not forget to fix the path as explained in the recipe—the File should be specified with a path relative to the solution folder. For example, suppose that your report will be published in my_solution/reports, and you put the transformation file in my_solution/reports/resources. In that case, for File, you should type resources/ plus the name of the transformation. There's more... Pentaho Reporting is a suite of Java projects built for report generation. The suite is made up of the Pentaho Reporting Engine and a set of tools such as the Report Designer (the tool used in this recipe), Report Design Wizard, and Pentaho's web-based Ad Hoc Reporting user interface. In order to be able to run transformations, the Pentaho Reporting software includes the Kettle libraries. To avoid any inconvenience, be sure that the versions of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho Reporting 3.8 includes Kettle 4.1.2 libraries. If you are using a different version of Pentaho Reporting, then you can verify the Kettle version by looking in the lib folder inside the reporting installation folder. You should look for files named kettle-core-<version>.jar, kettle-db-<version>.jar, and kettle-engine-<version>.jar. Besides, if the transformations you want to use as data sources rely on external libraries, then you have to copy the proper jar files from the Kettle libext directory into the Report Designer lib folder, just as you did with the janino.jar file in the recipe. For more information about Pentaho Reporting, just visit the following wiki website: http://wiki.pentaho.com/display/Reporting/Pentaho+Reporting+Community+Documentation Alternatively, you can get the book Pentaho Reporting 3.5 for Java Developers (Packt Publishing) by Will Gorman. Configuring the Pentaho BI Server for running PDI jobs and transformations Configuring the Pentaho BI Server for running PDI jobs and transformations The Pentaho BI Server is a collection of software components that provide the architecture and infrastructure required to build business intelligence solutions. With the Pentaho BI Server, you are able to run reports, visualize dashboards, schedule tasks, and more. Among these tasks, there is the ability to run Kettle jobs and transformations. This recipe shows you the minor changes you might have to make in order to be able to run Kettle jobs and transformations. Getting ready In order to follow this recipe, you will need some experience with the Pentaho BI Server. For configuring the Pentaho BI server, you obviously need the software. You can download the latest version of the Pentaho BI Server from the following URL: http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/ Make sure you download the distribution that matches your platform. If you intend to run jobs and transformations from a Kettle repository, then make sure you have the name of the repository and proper credentials (user and password). How to do it... Carry out the following steps: If you intend to run a transformation or a job from a file, skip to the How it works section. Edit the settings.xml file located in the biserver-cepentaho-solutionssystemkettle folder inside the Pentaho BI Server installation folder. In the repository.type tag, replace the default value files with rdbms. Provide the name of your Kettle repository and the user and password, as shown in the following example: <kettle-repository> <!-- The values within <properties> are passed directly to the Kettle Pentaho components. --> <!-- This is the location of the Kettle repositories.xml file, leave empty if the default is used: $HOME/.kettle/repositories.xml --> <repositories.xml.file></repositories.xml.file> <repository.type>rdbms</repository.type> <!-- The name of the repository to use --> <repository.name>pdirepo</repository.name> <!-- The name of the repository user --> <repository.userid>dev</repository.userid> <!-- The password --> <repository.password>1234</repository.password> </kettle-repository> Start the server. It will be ready to run jobs and transformations from your Kettle repository. How it works... If you want to run Kettle transformations and jobs, then the Pentaho BI server already includes the Kettle libraries. The server is ready to run both jobs and transformations from files. If you intend to use a repository, then you have to provide the repository settings. In order to do this, you just have to edit the settings.xml file, as you did in the recipe. There's more... To avoid any inconvenience, be sure that the version of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho BI Server 3.7 includes Kettle 4.1 libraries. If you are using a different version of the server, then you can verify the Kettle version by looking in the following folder: biserver-cetomcatwebappspentahoWEB-INFlib This folder is inside the server installation folder. You should look for files named kettlecore-TRUNK-SNAPSHOT .jar, kettle-db-TRUNK-SNAPSHOT.jar, and kettleengine-TRUNK-SNAPSHOT.jar. Unzip any of them and look for the META-INFMANIFEST.MF file. There, you will find the Kettle version. You will see a line like this: Implementation-Version: 4.1.0. There is even an easier way: In the Pentaho User Console (PUC), look for the option 2. Get Environment Information inside the Data Integration with Kettle folder of the BI Developer Examples solution; run it and you will get detailed information about the Kettle environment. For your information, the transformation that is run behind the scenes is GetPDIEnvironment.ktr located in the biservercepentaho-solutionsbi-developersetl folder.
Read more
  • 0
  • 0
  • 6683

article-image-the-u-s-dod-wants-to-dominate-russia-and-china-in-artificial-intelligence-last-week-gave-us-a-glimpse-into-that-vision
Savia Lobo
18 Mar 2019
9 min read
Save for later

The U.S. DoD wants to dominate Russia and China in Artificial Intelligence. Last week gave us a glimpse into that vision.

Savia Lobo
18 Mar 2019
9 min read
In a hearing on March 12, the sub-committee on emerging threats and capabilities received testimonies on Artificial Intelligence Initiatives within the Department of Defense(DoD). The panel included Peter Highnam, Deputy Director of the Defense Advanced Research Projects Agency; Michael Brown, DoD Defense Innovation Unit Director; and Lieutenant General John Shanahan, director of the Joint Artificial Intelligence Center (JAIC). The panel broadly testified to senators that AI will significantly transform DoD’s capabilities and that it is critical the U.S. remain competitive with China and Russia in developing AI applications. Dr. Peter T. Highnam on DARPA’s achievements and future goals Dr. Peter T. Highnam, Deputy Director, Defense Advanced Research Projects Agency talked about DARPA’s significant role in the development of AI technologies that have produced game-changing capabilities for the Department of Defense and beyond. In his testimony, he mentions, “DARPA’s AI Next effort is simply a continuing part of its 166 historic investment in the exploration and advancement of AI technologies.” Dr. Highnam highlighted different waves of AI technologies. The first wave, which was nearly 70 years ago, emphasized handcrafted knowledge, and computer scientists constructed so-called expert systems that captured the rules that the system could then apply to situations of interest. However, handcrafting rules was costly and time-consuming. The second wave that brought in machine learning that applies statistical and probabilistic methods to large data sets to create generalized representations that can be applied to future samples. However, this required training deep learning (artificial) neural networks with a variety of classification and prediction tasks when adequate historical data. Therein lies the rub, however, as the task of collecting, labelling, and vetting data on which to train. Such a process is prohibitively costly and time-consuming too. He says, “DARPA envisions a future in which machines are more than just tools that execute human programmed rules or generalize from human-curated data sets. Rather, the machines DARPA envisions will function more as colleagues than as tools.” Towards this end, DARPA is focusing its investments on a “third wave” of AI technologies that brings forth machines that can reason in context. Incorporating these technologies in military systems that collaborate with warfighters will facilitate better decisions in complex, time-critical, battlefield environments; enable a shared understanding of massive, incomplete, and contradictory information; and empower unmanned systems to perform critical missions safely and with high degrees of autonomy. DARPA’s more than $2 billion “AI Next” campaign, announced in September 2018, includes providing robust foundations for second wave technologies, aggressively applying the second wave AI technologies into appropriate systems, and exploring and creating third wave AI science and technologies. DARPA’s third wave research efforts will forge new theories and methods that will make it possible for machines to adapt contextually to changing situations, advancing computers from tools to true collaborative partners. Furthermore, the agency will be fearless about exploring these new technologies and their capabilities – DARPA’s core function – pushing critical frontiers ahead of our nation’s adversaries. To know more about this in detail, read Dr. Peter T. Highnam’s complete statement. Michael Brown on (Defense Innovation Unit) DIU’s efforts in Artificial Intelligence Michael Brown, Director of the Defense Innovation Unit, started the talk by highlighting on the fact how China and Russia are investing heavily to become dominant in AI.  “By 2025, China will aim to achieve major breakthroughs in AI and increase its domestic market to reach $59.6 billion (RMB 400 billion) To achieve these targets, China’s National Development and Reform Commission (China’s industrial policy-making agency) funded the creation of a national AI laboratory, and Chinese local governments have pledged more than $7 billion in AI funding”, Brown said in his statement. He said that these Chinese firms are in a way leveraging U.S. talent by setting up research institutes in the state, investing in U.S. AI-related startups and firms, recruiting U.S.-based talent, and commercial and academic partnerships. Brown said that DIU will engage with DARPA and JAIC(Joint Artificial Intelligence Center) and also make its commercial knowledge and relationships with potential vendors available to any of the Services and Service Labs. DIU also anticipates that with its close partnership with the JAIC, DIU will be at the leading edge of the Department’s National Mission Initiatives (NMIs), proving that commercial technology can be applied to critical national security challenges via accelerated prototypes that lay the groundwork for future scaling through JAIC. “DIU looks to bring in key elements of AI development pursued by the commercial sector, which relies heavily on continuous feedback loops, vigorous experimentation using data, and iterative development, all to achieve the measurable outcome, mission impact”, Brown mentions. DIU’s AI portfolio team combines depth of commercial AI, machine learning, and data science experience from the commercial sector with military operators. However, they have specifically prioritized projects that address three major impact areas or use cases which employ AI technology, including: Computer vision The DIU is prototyping computer vision algorithms in humanitarian assistance and disaster recovery scenarios. “This use of AI holds the potential to automate post-disaster assessments and accelerate search and rescue efforts on a global scale”, Brown said in his statement. Large dataset analytics and predictions DIU is prototyping predictive maintenance applications for Air Force and Army platforms. For this DIU plans to partner with JAIC to scale this solution across multiple aircraft platforms, as well as ground vehicles beginning with DIU’s complementary predictive maintenance project focusing on the Army’s Bradley Fighting Vehicle. Brown says this is one of DIU’s highest priority projects for FY19 given its enormous potential for impact on readiness and reducing costs. Strategic reasoning DIU is prototyping an application from Project VOLTRON that leverages AI to reason about high-level strategic questions, map probabilistic chains of events, and develop alternative strategies. This will make DoD owned systems more resilient to cyber attacks and inform program offices of configuration errors faster and with fewer errors than humans. Know more about what more DIU plans in partnership with DARPA and JAIC, in detail, in Michael Brown’s complete testimony. Lieutenant General Jack Shanahan on making JAIC “AI-Ready” Lieutenant General Jack Shanahan, Director, Joint Artificial Intelligence Center, touches upon  how the JAIC is partnering with the Under Secretary of Defense (USD) Research & Engineering (R&E), the role of the Military Services, the Department’s initial focus areas for AI delivery, and how JAIC is supporting whole-of-government efforts in AI. “To derive maximum value from AI application throughout the Department, JAIC will operate across an end-to-end lifecycle of problem identification, prototyping, integration, scaling, transition, and sustainment. Emphasizing commerciality to the maximum extent practicable, JAIC will partner with the Services and other components across the Joint Force to systematically identify, prioritize, and select new AI mission initiatives”, Shanahan mentions in his testimony. The AI capability delivery efforts that will go through this lifecycle will fall into two categories including National Mission Initiatives (NMI) and Component Mission Initiatives (CMI). NMI is an operational or business reform joint challenge, typically identified from the National Defense Strategy’s key operational problems and requiring multi-service innovation, coordination, and the parallel introduction of new technology and new operating concepts. On the other hand, Component Mission Initiatives (CMI) is a component-level challenge that can be solved through AI. JAIC will work closely with individual components on CMIs to help identify, shape, and accelerate their Component-specific AI deployments through: funding support; usage of common foundational tools, libraries, cloud infrastructure; application of best practices; partnerships with industry and academia; and so on. The Component will be responsible for identifying and implementing the organizational structure required to accomplish its project in coordination and partnership with the JAIC. Following are some examples of early NMI’s by JAIC to deliver mission impact at speed, demonstrate the proof of concept for the JAIC operational model, enable rapid learning and iterative process refinement, and build their library of reusable tools while validating JAIC’s enterprise cloud architecture. Perception Improve the speed, completeness, and accuracy of Intelligence, Surveillance, Reconnaissance (ISR) Processing, Exploitation, and Dissemination (PED). Shanahan says Project Maven’s efforts are included here. Predictive Maintenance (PMx) Provide computational tools to decision-makers to help them better forecast, diagnose, and manage maintenance issues to increase availability, improve operational effectiveness, and ensure safety, at a reduced cost. Humanitarian Assistance/Disaster Relief (HA/DR) Reduce the time associated with search and discovery, resource allocation decisions, and executing rescue and relief operations to save lives and livelihood during disaster operations. Here, JAIC plans to apply lessons learned and reusable tools from Project Maven to field AI capabilities in support of federal responses to events such as wildfires and hurricanes—where DoD plays a supporting role. Cyber Sensemaking Detect and deter advanced adversarial cyber actors who infiltrate and operate within the DoD Information Network (DoDIN) to increase DoDIN security, safeguard sensitive information, and allow warfighters and engineers to focus on strategic analysis and response. Shanahan states, “Under the DoD CIO’s authorities and as delineated in the JAIC establishment memo, JAIC will coordinate all DoD AI-related projects above $15 million annually.” “It does mean that we will start to ensure, for example, that they begin to leverage common tools and libraries, manage data using best practices, reflect a common governance framework, adhere to rigorous testing and evaluation methodologies, share lessons learned, and comply with architectural principles and standards that enable scale”, he further added. To know more about this in detail, read Lieutenant General Jack Shanahan’s complete testimony. To know more about this news in detail, watch the entire hearing on 'Artificial Intelligence Initiatives within the Department of Defense' So, you want to learn artificial intelligence. Here’s how you do it. What can happen when artificial intelligence decides on your loan request Mozilla partners with Ubisoft to Clever-Commit its code, an artificial intelligence assisted assistant
Read more
  • 0
  • 0
  • 6681
Modal Close icon
Modal Close icon