Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-probabilistic-graphical-models-r
Packt
14 Apr 2016
18 min read
Save for later

Probabilistic Graphical Models in R

Packt
14 Apr 2016
18 min read
In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don't exploit it and if we don't try to extract knowledge out of it. At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century. (For more resources related to this topic, see here.) More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand. Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short). Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can't observe everything, and we can't represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution. In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let's consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following: Se: This means season of the year N: This means that the nose is blocked H: This means the patient has a headache S: This means that the patient regularly sneezes C: This means that the patient coughs Cold: This means the patient has a cold. Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient's nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4. In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It's quite a lot, and honestly, it's quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes: In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable. Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits: As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models The knowledge and dependencies represented in a graph are easy to understand and communicate The graph induces a compact representation of the joint probability distribution and it is easy to make computations with Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude. The junction tree algorithm The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities. One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution. Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain. Let's say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization: This is represented by the following graph: We first start by loading the gRain package into R: library(gRain) Then, we create our set of random variables from A to F: val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val) The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once: plist = compileCPT(list(F,E,C,A,D,B)) plist Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code: CPTspec with probabilities:  P( F )  P( E | F )  P( C | F )  P( A | C )  P( D | E )  P( B | A D ) This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables: print(plist$F) print(plist$B) F  true false 0.1   0.9 , , D = true        A B       true false   true   0.6   0.7   false  0.4   0.3 , , D = false          A B       true false   true   0.2   0.1   false  0.8   0.9 The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D). We finally create the graph and run the junction tree algorithm by calling this: jtree = grain(plist) Again, when we check the result, we obtain: jtree Independence network: Compiled: FALSE Propagated: FALSE   Nodes: chr [1:6] "F" "E" "C" "A" "D" "B" We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree. Let's perform a few queries: querygrain(jtree, nodes=c("F"), type="marginal") $F F  true false 0.1   0.9 Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents.  querygrain(jtree, nodes=c("C"), type="marginal") $C C  true false 0.19  0.81 This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too. But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   And, can ask more complex distribution, such as the joint distribution of B and A: querygrain(jtree, nodes=c("A","B"), type="joint")        B A           true    false   true  0.309272 0.352728   false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true          A C           true    false   true  0.044420 0.047630   false 0.264852 0.121662   , , B = false          A C           true    false   true  0.050580 0.047370   false 0.302148 0.121338 Now, we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate this information down to the rest of the network: jtree2 = setEvidence(jtree, evidence=list(F="true")) We can ask for any joint or marginal now: querygrain(jtree, nodes=c("A"), type="marginal") $A A  true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A  true false  0.68  0.32 Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   querygrain(jtree2, nodes=c("B"), type="marginal") $B B   true  false 0.4696 0.5304 Learning Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration. Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model's user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do. Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models. A graphical model of a linear model PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i: Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi. The PGM for an observation of a linear model can be represented as follows: So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ: In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is    , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too: In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter  can also be considered as a random variable. For the time being, we will keep it fixed. Latent Dirichlet Allocation The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before. The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture. A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document. In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding. Let's say is  a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph): Let  be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on  is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model. This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics: α chooses the set of topics for a documents From θ, we generate a topic zij From this topic, we generate a word wj In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words. The distribution of a topic mixture based on this graphical mode can be written as follows: You can see in this formula that for each word, we select a topic, hence the product from 1 to N. Integrating over θ and summing over z, the marginal distribution of a document is as follows: The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents: The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following: Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. Summary The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular. Resources for Article:   Further resources on this subject: Extending ElasticSearch with Scripting [article] Exception Handling in MySQL for Python [article] Breaking the Bank [article]
Read more
  • 0
  • 0
  • 12660

article-image-detecting-fraud-e-commerce-orders-benfords-law
Packt
14 Apr 2016
7 min read
Save for later

Detecting fraud on e-commerce orders with Benford's law

Packt
14 Apr 2016
7 min read
In this article by Andrea Cirillo, author of the book RStudio for R Statistical Computing Cookbook, has explained how to detect fraud on e-commerce orders. Benford's law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution. This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alteration of population of data. Basically, testing a population against Benford's law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations. In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution. (For more resources related to this topic, see here.) Getting ready This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli. We therefore need to install and load this package: install.packages("benford.analysis") library(benford.analysis) In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file. In order to make it available within your environment, we need to load this file by running the following command (assuming the file is within your current working directory): load("ecommerce_orders_list.Rdata") How to do it... Perform Benford test on the order amounts: benford_test <- benford(ecommerce_orders_list$order_amount,1) Plot test analysis: plot(benford_test) This will result in the following plot: Highlights supectes digits: suspectsTable(benford_test) This will produce a table showing for each digit absolute differences between expected and observed frequencies. The first digits will therefore be more anomalous ones: > suspectsTable(benford_test)    digits absolute.diff 1:      5     4860.8974 2:      9     3764.0664 3:      1     2876.4653 4:      2     2870.4985 5:      3     2856.0362 6:      4     2706.3959 7:      7     1567.3235 8:      6     1300.7127 9:      8      200.4623 Define a function to extrapolate the first digit from each amount: left = function (string,char){   substr(string,1,char)} Extrapolate the first digit from each amount: ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1) Filter amounts starting with the suspected digit: suspects_orders <- subset(ecommerce_orders_list,first_digit == 5) How it works Step 1 performs the Benford test on the order amounts. In this step, we applied the benford() function to the amounts. Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution. The function will result in the production of the following objects: Object Description Info This object covers the following general information: data.name: This shows the name of the data used n: This shows the number of observations used n.second.order: This shows the number of observations used for second-order analysis number.of.digits: This shows the number of first digits analyzed Data This is a data frame with the following subobjects: lines.used: This shows  the original lines of the dataset data.used: This shows the data used data.mantissa: This shows the log data's mantissa data.digits: This shows the first digits of the data s.o.data This is a data frame with the following subobjects: data.second.order: This shows the differences of the ordered data  data.second.order.digits: This shows the first digits of the second-order analysis Bfd This is a data frame with the following subobjects: digits: This highlights the groups of digits analyzed data.dist: This highlights the distribution of the first digits of the data data.second.order.dist: This highlights the distribution of the first digits of the second-order analysis benford.dist: This shows the theoretical Benford distribution data.second.order.dist.freq: This shows the frequency distribution of the first digits of the second-order analysis data.dist.freq: This shows the frequency distribution of the first digits of the data benford.dist.freq: This shows the theoretical Benford frequency distribution benford.so.dist.freq: This shows the theoretical Benford frequency distribution of the second order analysis. data.summation: This shows the summation of the data values grouped by first digits abs.excess.summation: This shows the absolute excess summation of the data values grouped by first digits difference: This highlights the difference between the data and Benford frequencies squared.diff: This shows the chi-squared difference between the data and Benford frequencies absolute.diff: This highlights the absolute difference between the data and Benford frequencies Mantissa This is a data frame with the following subobjects: mean.mantissa: This shows the mean of the mantissa var.mantissa: This shows the variance of the mantissa ek.mantissa: This shows the excess kurtosis of the mantissa sk.mantissa: This highlights the skewness of the mantissa MAD This object depicts the mean absolute deviation. distortion.factor This object talks about the distortion factor. Stats This object lists of htest class statistics as follows: chisq: This lists the Pearson's Chi-squared test. mantissa.arc.test: This lists the Mantissa Arc test Step 2 plots test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner): First digit distribution Results of second-order test Summation distribution for each digit Results of chi-squared test Summation differences If you look carefully at these plots, you will understand which digits show up a distribution significantly different from the one expected from the Benford law. Nevertheless, in order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step. Step 3 highlights suspects digits. Using suspectsTable() we can easily discover which digits presents the greater deviation from the expected distribution. Looking at the so-called suspects table, we can see that number 5 shows up as the first digit within our table. In the next step, we will focus our attention on the orders with amounts having this digit as the first digit. Step 4 defines a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument. Step 5 adds a new column to the investigated dataset where the first digit is extrapolated. Step 6 filters amounts starting with the suspected digit. After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical, testing procedures on those items. Summary In this article, you learned how to apply the R language to an e-commerce fraud detection system. Resources for Article: Further resources on this subject: Recommending Movies at Scale (Python) [article] Visualization of Big Data [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 3169

article-image-cluster-computing-using-scala
Packt
13 Apr 2016
18 min read
Save for later

Cluster Computing Using Scala

Packt
13 Apr 2016
18 min read
In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]
Read more
  • 0
  • 0
  • 4814

article-image-market-basket-analysis
Packt
12 Apr 2016
17 min read
Save for later

Market Basket Analysis

Packt
12 Apr 2016
17 min read
In this article by Boštjan Kaluža, author of the book Machine Learning in Java, we will discuss affinity analysis which is the heart of Market Basket Analysis (MBA). It can discover co-occurrence relationships among activities performed by specific users or groups. In retail, affinity analysis can help you understand the purchasing behavior of customers. These insights can drive revenue through smart cross-selling and upselling strategies and can assist you in developing loyalty programs, sales promotions, and discount plans. In this article, we will look into the following topics: Market basket analysis Association rule learning Other applications in various domains First, we will revise the core association rule learning concepts and algorithms, such as support, lift, Apriori algorithm, and FP-growth algorithm. Next, we will use Weka to perform our first affinity analysis on supermarket dataset and study how to interpret the resulting rules. We will conclude the article by analyzing how association rule learning can be applied in other domains, such as IT Operations Analytics, medicine, and others. (For more resources related to this topic, see here.) Market basket analysis Since the introduction of electronic point of sale, retailers have been collecting an incredible amount of data. To leverage this data in order to produce business value, they first developed a way to consolidate and aggregate the data to understand the basics of the business. What are they selling? How many units are moving? What is the sales amount? Recently, the focus shifted to the lowest level of granularity—the market basket transaction. At this level of detail, the retailers have direct visibility into the market basket of each customer who shopped at their store, understanding not only the quantity of the purchased items in that particular basket, but also how these items were bought in conjunction with each other. This can be used to drive decisions about how to differentiate store assortment and merchandise, as well as effectively combine offers of multiple products, within and across categories, to drive higher sales and profits. These decisions can be implemented across an entire retail chain, by channel, at the local store level, and even for the specific customer with the so-called personalized marketing, where a unique product offering is made for each customer. MBA covers a wide variety of analysis: Item affinity: This defines the likelihood of two (or more) items being purchased together Identification of driver items: This enables the identification of the items that drive people to the store and always need to be in stock Trip classification: This analyzes the content of the basket and classifies the shopping trip into a category: weekly grocery trip, special occasion, and so on Store-to-store comparison: Understanding the number of baskets allows any metric to be divided by the total number of baskets, effectively creating a convenient and easy way to compare the stores with different characteristics (units sold per customer, revenue per transaction, number of items per basket, and so on) Revenue optimization: This helps in determining the magic price points for this store, increasing the size and value of the market basket Marketing: This helps in identifying more profitable advertising and promotions, targeting offers more precisely in order to improve ROI, generating better loyalty card promotions with longitudinal analysis, and attracting more traffic to the store Operations optimization: This helps in matching the inventory to the requirement by customizing the store and assortment to trade area demographics, optimizing store layout Predictive models help retailers to direct the right offer to the right customer segments/profiles, as well as gain understanding of what is valid for which customer, predict the probability score of customers responding to this offer, and understand the customer value gain from the offer acceptance. Affinity analysis Affinity analysis is used to determine the likelihood that a set of items will be bought together. In retail, there are natural product affinities, for example, it is very typical for people who buy hamburger patties to buy hamburger rolls, along with ketchup, mustard, tomatoes, and other items that make up the burger experience. While there are some product affinities that might seem trivial, there are some affinities that are not very obvious. A classic example is toothpaste and tuna. It seems that people who eat tuna are more prone to brush their teeth right after finishing their meal. So, why it is important for retailers to get a good grasp of the product affinities? This information is critical to appropriately plan promotions as reducing the price for some items may cause a spike on related high-affinity items without the need to further promote these related items. In the following section, we'll look into the algorithms for association rule learning: Apriori and FP-growth. Association rule learning Association rule learning has been a popular approach for discovering interesting relations between items in large databases. It is most commonly applied in retail for discovering regularities between products. Association rule learning approaches find patterns as interesting strong rules in the database using different measures of interestingness. For example, the following rule would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat:{onions, potatoes} à {burger} Another classic story probably told in every machine learning class is the beer and diaper story. An analysis of supermarket shoppers' behavior showed that customers, presumably young men, who buy diapers tend also to buy beer. It immediately became a popular example of how an unexpected association rule might be found from everyday data; however, there are varying opinions as to how much of the story is true. Daniel Powers says (DSS News, 2002): In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves. In addition to the preceding example from MBA, association rules are today employed in many application areas, including web usage mining, intrusion detection, continuous production, and bioinformatics. We'll take a closer look these areas later in this article. Basic concepts Before we dive into algorithms, let's first review the basic concepts. Database of transactions First, there is no class value, as this is not required for learning association rules. Next, the dataset is presented as a transactional table, where each supermarket item corresponds to a binary attribute. Hence, the feature vector could be extremely large. Consider the following example. Suppose we have five receipts as shown in the following image. Each receipt corresponds a purchasing transaction: To write these receipts in the form of transactional database, we first identify all the possible items that appear in the receipts. These items are onions, potatoes, burger, beer, and dippers. Each purchase, that is, transaction, is presented in a row, and there is 1 if an item was purchased within the transaction and 0 otherwise, as shown in the following table: Transaction ID Onions Potatoes Burger Beer Dippers 1 0 1 1 0 0 2 1 1 1 1 0 3 0 0 0 1 1 4 1 0 1 1 0 This example is really small. In practical applications, the dataset often contains thousands or millions of transactions, which allow learning algorithm discovery of statistically significant patterns. Itemset and rule Itemset is simply a set of items, for example, {onions, potatoes, burger}. A rule consists of two itemsets, X and Y, in the following format X -> Y. This indicates a pattern that when the X itemset is observed, Y is also observed. To select interesting rules, various measures of significance can be used. Support Support, for an itemset, is defined as the proportion of transactions that contain the itemset. The {potatoes, burger} itemset in the previous table has the following support as it occurs in 50% of transactions (2 out of 4 transactions) supp({potatoes, burger }) = 2/4 = 0.5. Intuitively, it indicates the share of transactions that support the pattern. Confidence Confidence of a rule indicates its accuracy. It is defined as Conf(X -> Y) = supp(X U Y) / supp(X). For example, the {onions, burger} -> {beer} rule has the confidence 0.5/0.5 = 1.0 in the previous table, which means that 100% of the times when onions and burger are bought together, beer is bought as well. Apriori algorithm Apriori algorithm is a classic algorithm used for frequent pattern mining and association rule learning over transactional. By identifying the frequent individual items in a database and extending them to larger itemsets, Apriori can determine the association rules, which highlight general trends about a database. Apriori algorithm constructs a set of itemsets, for example, itemset1= {Item A, Item B}, and calculates support, which counts the number of occurrences in the database. Apriori then uses a bottom up approach, where frequent itemsets are extended, one item at a time, and it works by eliminating the largest sets as candidates by first looking at the smaller sets and recognizing that a large set cannot be frequent unless all its subsets are. The algorithm terminates when no further successful extensions are found. Although, Apriori algorithm is an important milestone in machine learning, it suffers from a number of inefficiencies and tradeoffs. In the following section, we'll look into a more recent FP-growth technique. FP-growth algorithm FP-growth, where frequent pattern (FP), represents the transaction database as a prefix tree. First, the algorithm counts the occurrence of items in the dataset. In the second pass, it builds a prefix tree, an ordered tree data structure commonly used to store a string. An example of prefix tree based on the previous example is shown in the following diagram: If many transactions share most frequent items, prefix tree provides high compression close to the tree root. Large itemsets are grown directly, instead of generating candidate items and testing them against the entire database. Growth starts at the bottom of the tree, by finding all the itemsets matching minimal support and confidence. Once the recursive process has completed, all large itemsets with minimum coverage have been found and association rule creation begins. FP-growth algorithms have several advantages. First, it constructs an FP-tree, which encodes the original dataset in a substantially compact presentation. Second, it efficiently builds frequent itemsets, leveraging the FP-tree structure and divide-and-conquer strategy. The supermarket dataset The supermarket dataset, located in datasets/chap5/supermarket.arff, describes the shopping habits of supermarket customers. Most of the attributes stand for a particular item group, for example, diary foods, beef, potatoes; or department, for example, department 79, department 81, and so on. The value is t if the customer had bought an item and missing otherwise. There is one instance per customer. The dataset contains no class attribute, as this is not required to learn association rules. A sample of data is shown in the following table: Discover patterns To discover shopping patterns, we will use the two algorithms that we have looked into before, Apriori and FP-growth. Apriori We will use the Apriori algorithm as implemented in Weka. It iteratively reduces the minimum support until it finds the required number of rules with the given minimum confidence: import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instances; import weka.associations.Apriori; First, we will load the supermarket dataset: Instances data = new Instances( new BufferedReader( new FileReader("datasets/chap5/supermarket.arff"))); Next, we will initialize an Apriori instance and call the buildAssociations(Instances) function to start frequent pattern mining, as follows: Apriori model = new Apriori(); model.buildAssociations(data); Finally, we can output the discovered itemsets and rules, as shown in the following code: System.out.println(model); The output is as follows: Apriori ======= Minimum support: 0.15 (694 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 44 Size of set of large itemsets L(2): 380 Size of set of large itemsets L(3): 910 Size of set of large itemsets L(4): 633 Size of set of large itemsets L(5): 105 Size of set of large itemsets L(6): 1 Best rules found: 1. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:(1.27) lev:(0.03) [155] conv:(3.35) 2. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696 <conf:(0.92)> lift:(1.27) lev:(0.03) [149] conv:(3.28) 3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 <conf:(0.92)> lift:(1.27) lev:(0.03) [150] conv:(3.27) ... The algorithm outputs ten best rules according to confidence. Let's look the first rule and interpret the output, as follows: biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:(1.27) lev:(0.03) [155] conv:(3.35) It says that when biscuits, frozen foods, and fruits are bought together and the total purchase price is high, it is also very likely that bread and cake are purchased as well. The {biscuits, frozen foods, fruit, total high} itemset appears in 778 transactions, while the {bread, cake} itemset appears in 723 transactions. The confidence of this rule is 0.92, meaning that the rule holds true in 92% of transactions where the {biscuits, frozen foods, fruit, total high} itemset is present. The output also reports additional measures such as lift, leverage, and conviction, which estimate the accuracy against our initial assumptions, for example, the 3.35 conviction value indicates that the rule would be incorrect 3.35 times as often if the association was purely a random chance. Lift measures the number of times X and Y occur together than expected if they where statistically independent (lift=1). The 2.16 lift in the X -> Y rule means that the probability of X is 2.16 times greater than the probability of Y. FP-growth Now, let's try to get the same results with more efficient FP-growth algorithm. FP-growth is also implemented in the weka.associations package: import weka.associations.FPGrowth; The FP-growth is initialized similarly as we did earlier: FPGrowth fpgModel = new FPGrowth(); fpgModel.buildAssociations(data); System.out.println(fpgModel); The output reveals that FP-growth discovered 16 rules: FPGrowth found 16 rules (displaying top 10) 1. [fruit=t, frozen foods=t, biscuits=t, total=high]: 788 ==> [bread and cake=t]: 723 <conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.35) 2. [fruit=t, baking needs=t, biscuits=t, total=high]: 760 ==> [bread and cake=t]: 696 <conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.28) ... We can observe that FP-growth found the same set of rules as Apriori; however, the time required to process larger datasets can be significantly shorter. Other applications in various areas We looked into affinity analysis to demystify shopping behavior patterns in supermarkets. Although, the roots of association rule learning are in analyzing point-of-sale transactions, they can be applied outside the retail industry to find relationships among other types of baskets. The notion of a basket can easily be extended to services and products, for example, to analyze items purchased using a credit card, such as rental cars and hotel rooms, and to analyze information on value-added services purchased by telecom customers (call waiting, call forwarding, DSL, speed call, and so on), which can help the operators determine the ways to improve their bundling of service packages. Additionally, we will look into the following examples of potential cross-industry applications: Medical diagnosis Protein sequences Census data Customer relationship management IT Operations Analytics Medical diagnosis Applying association rules in medical diagnosis can be used to assist physicians while curing patients. The general problem of the induction of reliable diagnostic rules is hard as, theoretically, no induction process can guarantee the correctness of induced hypotheses by itself. Practically, diagnosis is not an easy process as it involves unreliable diagnosis tests and the presence of noise in training examples. Nevertheless, association rules can be used to identify likely symptoms appearing together. A transaction, in this case, corresponds to a medical case, while symptoms correspond to items. When a patient is treated, a list of symptoms is recorded as one transaction. Protein sequences A lot of research has gone into understanding the composition and nature of proteins; yet many things remain to be understood satisfactorily. It is now generally believed that amino-acid sequences of proteins are not random. With association rules, it is possible to identify associations between different amino acids that are present in a protein. A protein is a sequences made up of 20 types of amino acids. Each protein has a unique three-dimensional structure, which depends on amino-acid sequence; slight change in the sequence may change the functioning of protein. To apply association rules, a protein corresponds to a transaction, while amino acids, their two grams and structure correspond to the items. Such association rules are desirable for enhancing our understanding of protein composition and hold the potential to give clues regarding the global interactions amongst some particular sets of amino acids occurring in the proteins. Knowledge of these association rules or constraints is highly desirable for synthesis of artificial proteins. Census data Censuses make a huge variety of general statistical information about the society available to both researchers and general public. The information related to population and economic census can be forecasted in planning public services (education, health, transport, and funds) as well as in public business(for setting up new factories, shopping malls, or banks and even marketing particular products). To discover frequent patterns, each statistical area (for example, municipality, city, and neighborhood) corresponds to a transaction, and the collected indicators correspond to the items. Customer relationship management Association rules can reinforce the knowledge management process and allow the marketing personnel to know their customers well in order to provide better quality services. For example, association rules can be applied to detect a change of customer behavior at different time snapshots from customer profiles and sales data. The basic idea is to discover changes from two datasets and generate rules from each dataset to carry out rule matching. IT Operations Analytics Based on records of a large number of transactions, association rule learning is well-suited to be applied to the data that is routinely collected in day-to-day IT operations, enabling IT Operations Analytics tools to detect frequent patterns and identify critical changes. IT specialists need to see the big picture and understand, for example, how a problem on a database could impact an application server. For a specific day, IT operations may take in a variety of alerts, presenting them in a transactional database. Using an association rule learning algorithm, IT Operations Analytics tools can correlate and detect the frequent patterns of alerts appearing together. This can lead to a better understanding about how a component impacts another. With identified alert patterns, it is possible to apply predictive analytics. For example, a particular database server hosts a web application and suddenly an alert about a database is triggered. By looking into frequent patterns identified by an association rule learning algorithm, this means that the IT staff needs to take action before the web application is impacted. Association rule learning can also discover alert events originating from the same IT event. For example, every time a new user is added, six changes in the Windows operating systems are detected. Next, in the Application Portfolio Management (APM), IT may face multiple alerts, showing that the transactional time in a database as high. If all these issues originate from the same source (such as getting hundreds of alerts about changes that are all due to a Windows update), this frequent pattern mining can help to quickly cut through a number of alerts, allowing the IT operators to focus on truly critical changes. Summary In this article, you learned how to leverage association rules learning on transactional datasets to gain insight about frequent patterns We performed an affinity analysis in Weka and learned that the hard work lies in the analysis of results—careful attention is required when interpreting rules, as association (that is, correlation) is not the same as causation. Resources for Article: Further resources on this subject: Debugging Java Programs using JDB [article] Functional Testing with JMeter [article] Implementing AJAX Grid using jQuery data grid plugin jqGrid [article]
Read more
  • 0
  • 0
  • 17050

article-image-getting-started-d3-es2016-and-nodejs
Packt
08 Apr 2016
25 min read
Save for later

Getting Started with D3, ES2016, and Node.js

Packt
08 Apr 2016
25 min read
In this article by Ændrew Rininsland, author of the book Learning d3.js Data Visualization, Second Edition, we'll lay the foundations of what you'll need to run all the examples in the article. I'll explain how you can start writing ECMAScript 2016 (ES2016) today—which is the latest and most advanced version of JavaScript—and show you how to use Babel to transpile it to ES5, allowing your modern JavaScript to be run on any browser. We'll then cover the basics of using D3 to render a basic chart. (For more resources related to this topic, see here.) What is D3.js? D3 stands for Data-Driven Documents, and it is being developed by Mike Bostock and the D3 community since 2011. The successor to Bostock's earlier Protovis library, it allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL). D3's idioms should be immediately familiar to anyone with experience of using the massively popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky at best. After finishing this article, you should be able to understand D3 well enough to figure out the examples. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/mbostock/d3. The fine-grained control and its elegance make D3 one of the most—if not the most—powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two—in that case you might want to use a library designed for charting. Many use D3 internally anyway. One such interface is Axis, an open source app that I've written. It allows users to easily build basic line, pie, area, and bar charts without writing any code. Try it out at use.axisjs.org. As a data manipulation library, D3 is based on the principles of functional programming, which is probably where a lot of confusion stems from. Unfortunately, functional programming goes beyond the scope of this article, but I'll explain all the relevant bits to make sure that everyone's on the same page. What’s ES2016? One of the main changes in this edition is the emphasis on ES2016, the most modern version of JavaScript currently available. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this article: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2016 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2016) now. The big release was ES2015, which more or less maps to ES6. ES2016 is scheduled for ratification in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. You don't really need to worry about compatibility because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. For the sake of simplicity, I will use the word "ES2016" throughout in a general sense to refer to all modern JavaScript, but I'm not referring to the ECMAScript 2016 specification itself. Getting started with Node and Git on the command line I will try not to be too opinionated in this article about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm (I'm using 5.6.0 and 3.6.0, respectively), it means you're good to go. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. Next, you'll want to clone the article's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3 $ cd $ learning-d3 This will clone the development environment and all the samples in the learning-d3/ directory as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub pages, and even submit suggestions and amendments to the parent project. This will help us improve this article for future editions. To do this, fork aendrew/learning-d3 and replace aendrew in the preceding code snippet with your GitHub username. Each chapter of this book is in a separate branch. To switch between them, type the following command: $ git checkout chapter1 Replace 1 with whichever chapter you want the examples for. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this article. It includes a prebuilt package.json file (used by npm to manage dependencies), which we'll use to aid our development over the course of this article. There's also a webpack.config.js file, which tells the build system where to put things, and there are a few other sundry config files. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the src/ folder. You'll notice it contains an index.html and an index.js file; almost always, we'll be working in index.js, as index.html is just a minimal container to display our work in: <!DOCTYPE html> <div id="chart"></div> <script src="/assets/bundle.js"></script> To get things rolling, start the development server by typing the following line: $ npm start This starts up the Webpack development server, which will transform our ES2016 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. In the preceding HTML, bundle.js is the compiled code produced by Webpack. Now point Chrome to localhost:8080 and fire up the developer console (Ctrl +Shift + J for Linux and Windows and Option + Command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this article shorter, we'll stick to Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice. We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. One of the favorites from Developer Tools is the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are affecting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: The obligatory bar chart example No introductory chapter on D3 would be complete without a basic bar chart example. They are to D3 as "Hello World" is to everything else, and 90 percent of all data storytelling can be done in its simplest form with an intelligent bar or line chart. For a good example of this, look at the kinds of graphics The Economist includes with their articles—they frequently summarize the entire piece with a simple line chart. Coming from a newsroom development background, many of my examples will be related to some degree to current events or possible topics worth visualizing with data. The news development community has been really instrumental in creating the environment for D3 to flourish, and it's increasingly important for aspiring journalists to have proficiency in tools such as D3. The first dataset that we'll use is UNHCR's regional population data. The documentation for this endpoint is at data.unhcr.org/wiki/index.php/Get-population-regional.html. We'll create a bar for each population of displaced people. The first step is to get a basic container set up, which we can then populate with all of our delicious new ES2016 code. At the top of index.js, put the following code: export class BasicChart {   constructor(data) {     var d3 = require('d3'); // Require D3 via Webpack     this.data = data;     this.svg = d3.select('div#chart').append('svg');   } } var chart = new BasicChart(); If you open this in your browser, you'll get the following error on your console: Uncaught Error: Cannot find module "d3" This is because we haven't installed it yet. You’ll notice on line 3 of the preceding code that we import D3 by requiring it. If you've used D3 before, you might be more familiar with it attached to the window global object. This is essentially the same as including a script tag that references D3 in your HTML document, the only difference being that Webpack uses the Node version and compiles it into your bundle.js. To install D3, you use npm. In your project directory, type the following line: $ npm install d3 --save This will pull the latest version of D3 from npmjs.org to the node_modules directory and save it in your package.json file. The package.json file is really useful; instead of keeping all your dependencies inside of your Git repository, you can easily redownload them all just by typing this line: $ npm install If you go back to your browser and switch quickly to the Elements tab, you'll notice a new SVG element as a child of #chart. Go back to index.js. Let's add a bit more to the constructor before I explain what's going on here: export class BasicChart {   constructor(data) {     var d3 = require('d3'); // Require D3 via Webpack     this.data = data;     this.svg = d3.select('div#chart').append('svg');     this.margin = {       left: 30,       top: 30,       right: 0,       bottom: 0     };     this.svg.attr('width', window.innerWidth);     this.svg.attr('height', window.innerHeight);     this.width = window.innerWidth - this.margin.left - this.margin.right;     this.height = window.innerHeight - this.margin.top - this.margin.bottom;     this.chart = this.svg.append('g')       .attr('width', this.width)       .attr('height', this.height) .attr('transform', `translate(${this.margin.left}, ${this.margin.top})`);   } } Okay, here we have the most basic container you'll ever make. All it does is attach data to the class: this.data = data; This selects the #chart element on the page, appending an SVG element and assigning it to another class property: this.svg = d3.select('div#chart').append('svg'); Then it creates a third class property, chart, as a group that's offset by the margins: this.width = window.innerWidth - this.margin.left - this.margin.right;   this.height = window.innerHeight - this.margin.top - this.margin.bottom;   this.chart = svg.append('g')     .attr('width', this.width)     .attr('height', this.height)     .attr('transform', `translate(${this.margin.left}, ${this.margin.top})`); Notice the snazzy new ES2016 string interpolation syntax—using `backticks`, you can then echo out a variable by enclosing it in ${ and }. No more concatenating! The preceding code is not really all that interesting, but wouldn't it be awesome if you never had to type that out again? Well! Because you're the total boss and are learning ES2016 like all the cool kids, you won't ever have to. Let's create our first child class! We're done with BasicChart for the moment. Now, we want to create our actual bar chart class: export class BasicBarChart extends BasicChart {   constructor(data) {     super(data);   } } This is probably very confusing if you're new to ES6. First off, we're extending BasicChart, which means all the class properties that we just defined a minute ago are now available for our BasicBarChart child class. However, if we instantiate a new instance of this, we get the constructor function in our child class. How do we attach the data object so that it's available for both BasicChart and BasicBarChart? The answer is super(), which merely runs the constructor function of the parent class. In other words, even though we don't assign data to this.data as we did previously, it will still be available there when we need it. This is because it was assigned via the parent constructor through the use of super(). We're almost at the point of getting some bars onto that graph; hold tight! But first, we need to define our scales, which decide how D3 maps data to pixel values. Add this code to the constructor of BasicBarChart: let x = d3.scale.ordinal()   .rangeRoundBands([this.margin.left, this.width - this.margin.right], 0.1); The x scale is now a function that maps inputs from an as-yet-unknown domain (we don't have the data yet) to a range of values between this.margin.left and this.width - this.margin.right, that is, between 30 and the width of your viewport minus the right margin, with some spacing defined by the 0.1 value. Because it's an ordinal scale, the domain will have to be discrete rather than continuous. The rangeRoundBands means the range will be split into bands that are guaranteed to be round numbers. Hoorah! We have fit our first new fancy ES2016 feature! The let is the new var—you can still use var to define variables, but you should use let instead because it's limited in scope to the block, statement, or expression on which it is used. Meanwhile, var is used for more global variables, or variables that you want available regardless of the block scope. For more on this, visit http://mdn.io/let. If you have no idea what I'm talking about here, don't worry. It just means that you should define variables with let because they're more likely to act as you think they should and are less likely to leak into other parts of your code. It will also throw an error if you use it before it's defined, which can help with troubleshooting and preventing sneaky bugs. Still inside the constructor, we define another scale named y: let y = d3.scale.linear().range([this.height, this.margin.bottom]); Similarly, the y scale is going to map a currently unknown linear domain to a range between this.height and this.margin.bottom, that is, your viewport height and 30. Inverting the range is important because D3.js considers the top of a graph to be y=0. If ever you find yourself trying to troubleshoot why a D3 chart is upside down, try switching the range values. Now, we define our axes. Add this just after the preceding line, inside the constructor: let xAxis = d3.svg.axis().scale(x).orient('bottom'); let yAxis = d3.svg.axis().scale(y).orient('left'); We've told each axis what scale to use when placing ticks and which side of the axis to put the labels on. D3 will automatically decide how many ticks to display, where they should go, and how to label them. Now the fun begins! We're going to load in our data using Node-style require statements this time around. This works because our sample dataset is in JSON and it's just a file in our repository. For now, this will suffice for our purposes—no callbacks, promises, or observables necessary! Put this at the bottom of the constructor: let data = require('./data/chapter1.json'); Once or maybe twice in your life, the keys in your dataset will match perfectly and you won't need to transform any data. This almost never happens, and today is not one of those times. We're going to use basic JavaScript array operations to filter out invalid data and map that data into a format that's easier for us to work with: let totalNumbers = data.filter((obj) => { return obj.population.length;   })   .map(     (obj) => {       return {         name: obj.name,         population: Number(obj.population[0].value)       };     }   ); This runs the data that we just imported through Array.prototype.filter, whereby any elements without a population array are stripped out. The resultant collection is then passed through Array.prototype.map, which creates an array of objects, each comprised of a name and a population value. We've turned our data into a list of two-value dictionaries. Let's now supply the data to our BasicBarChart class and instantiate it for the first time. Consider the line that says the following: var chart = new BasicChart(); Replace it with this line: var myChart = new BasicBarChart(totalNumbers); The myChart.data will now equal totalNumbers! Go back to the constructor in the BasicBarChart class. Remember the x and y scales from before? We can finally give them a domain and make them useful. Again, a scale is a simply a function that maps an input range to an output domain: x.domain(data.map((d) => { return d.name })); y.domain([0, d3.max(data, (d) => { return d.population; })]); Hey, there's another ES2016 feature! Instead of typing function() {} endlessly, you can now just put () => {} for anonymous functions. Other than being six keystrokes less, the "fat arrow" doesn't bind the value of this to something else, which can make life a lot easier. For more on this, visit http://mdn.io/Arrow_functions. Since most D3 elements are objects and functions at the same time, we can change the internal state of both scales without assigning the result to anything. The domain of x is a list of discrete values. The domain of y is a range from 0 to the d3.max of our dataset—the largest value. Now we're going to draw the axes on our graph: this.chart.append('g')         .attr('class', 'axis')         .attr('transform', `translate(0, ${this.height})`)         .call(xAxis); We've appended an element called g to the graph, given it the axis CSS class, and moved the element to a place in the bottom-left corner of the graph with the transform attribute. Finally, we call the xAxis function and let D3 handle the rest. The drawing of the other axis works exactly the same, but with different arguments: this.chart.append('g')         .attr('class', 'axis')         .attr('transform', `translate(${this.margin.left}, 0)`)         .call(yAxis); Now that our graph is labeled, it's finally time to draw some data: this.chart.selectAll('rect')         .data(data)         .enter()         .append('rect')         .attr('class', 'bar')         .attr('x', (d) => { return x(d.name); })         .attr('width', x.rangeBand())         .attr('y', (d) => { return y(d.population); })         .attr('height', (d) => { return this.height - y(d.population); }); Okay, there's plenty going on here, but this code is saying something very simple. This is what is says: For all rectangles (rect) in the graph, load our data Go through it For each item, append a rect Then define some attributes Ignore the fact that there aren't any rectangles initially; what you're doing is creating a selection that is bound to data and then operating on it. I can understand that it feels a bit weird to operate on non-existent elements (this was personally one of my biggest stumbling blocks when I was learning D3), but it's an idiom that shows its usefulness later on when we start adding and removing elements due to changing data. The x scale helps us calculate the horizontal positions, and rangeBand gives the width of the bar. The y scale calculates vertical positions, and we manually get the height of each bar from y to the bottom. Note that whenever we needed a different value for every element, we defined an attribute as a function (x, y, and height); otherwise, we defined it as a value (width). Keep this in mind when you're tinkering. Let's add some flourish and make each bar grow out of the horizontal axis. Time to dip our toes into animations! Modify the code you just added to resemble the following. I've highlighted the lines that are different: this.chart.selectAll('rect')   .data(data)   .enter()   .append('rect')   .attr('class', 'bar')   .attr('x', (d) => { return x(d.name); })   .attr('width', x.rangeBand())   .attr('y', () => { return y(this.margin.bottom); })   .attr('height', 0)   .transition()     .delay((d, i) => { return i*20; })     .duration(800)     .attr('y', (d) => { return y(d.population); })     .attr('height', (d) => {          return this.height - y(d.population);       }); The difference is that we statically put all bars at the bottom (margin.bottom) and then entered a transition with .transition(). From here on, we define the transition that we want. First, we wanted each bar's transition delayed by 20 milliseconds using i*20. Most D3 callbacks will return the datum (or "whatever data has been bound to this element," which is typically set to d) and the index (or the ordinal number of the item currently being evaluated, which is typically i) while setting the this argument to the currently selected DOM element. Because of this last point, we use the fat arrow—so that we can still use the class this.height property. Otherwise, we'd be trying to find the height property on our SVGRect element, which we're midway to trying to define! This gives the histogram a neat effect, gradually appearing from left to right instead of jumping up at once. Next, we say that we want each animation to last just shy of a second, with .duration(800). At the end, we define the final values for the animated attributes—y and height are the same as in the previous code—and D3 will take care of the rest. Save your file and the page should auto-refresh in the background. If everything went according to the plan, you should have a chart that looks like the following: According to this UNHCR data from June 2015, by far the largest number of displaced persons are from Syria. Hey, look at this—we kind of just did some data journalism here! Remember that you can look at the entire code on GitHub at http://github.com/aendrew/learning-d3/tree/chapter1 if you didn't get something similar to the preceding screenshot. We still need to do just a bit more, mainly by using CSS to style the SVG elements. We could have just gone to our HTML file and added CSS, but then that means opening that yucky index.html file. And where's the fun in writing HTML when we're learning some newfangled JavaScript?! First, create an index.css file in your src/ directory: html, body {   padding: 0;   margin: 0; }   .axis path, .axis line {   fill: none;   stroke: #eee;   shape-rendering: crispEdges; }   .axis text {   font-size: 11px; }   .bar {   fill: steelblue; } Then just add the following line to index.js: require('./index.css'); I know. Crazy, right?! No <style> tags needed! It's worth noting that anything involving require is the result of a Webpack loader; in this article, we've used both the CSS/Style and JSON loaders. Although the author of this text is a fan of Webpack, all we're doing is compiling the styles into bundle.js with Webpack instead of requiring them globally via a <style> tag. This is cool because instead of uploading a dozen files when deploying your finished code, you effectively deploy one optimized bundle. You can also scope CSS rules to be particular to when they’re being included and all sorts of other nifty stuff; for more information, refer to github.com/webpack/css-loader#local-scope. Looking at the preceding CSS, you can now see why we added all those classes to our shapes—we can now directly reference them when styling with CSS. We made the axes thin, gave them a light gray color, and used a smaller font for the labels. The bars should be light blue. Save and wait for the page to refresh. We've made our first D3 chart! I recommend fiddling with the values for width, height, and margin inside of BasicChart to get a feel of the power of D3. You'll notice that everything scales and adjusts to any size without you having to change other code. Smashing! Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. This environment will be assumed throughout the article. We went through a simple example and created an animated histogram using some of the basics of D3. You learned about scales and axes, that the vertical axis is inverted, that any property defined as a function is recalculated for every data point, and that we use a combination of CSS and SVG to make things beautiful. We also did a lot of fancy stuff with ES2016, Babel, and Webpack and got Node.js installed. Go us! Most of all, this article has given you the basic tools so that you can start playing with D3.js on your own. Tinkering is your friend! Don't be afraid to break stuff—you can always reset to a chapter's default state by running $ git reset --soft origin/chapter1, replacing 1 with whichever chapter you're on. Next, we'll be looking at all this a bit more in depth, specifically how the DOM, SVG, and CSS interact with each other. This article discussed quite a lot, so if some parts got away from you, don't worry. Resources for Article: Further resources on this subject: An Introduction to Node.js Design Patterns [article] Developing Node.js Web Applications [article] Developing a Basic Site with Node.js and Express [article]
Read more
  • 0
  • 0
  • 2049

Packt
04 Apr 2016
20 min read
Save for later

Morphology – Getting Our Feet Wet

Packt
04 Apr 2016
20 min read
In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts. In brief, this article will include the following topics: Introducing morphology Creating a stemmer and lemmatizer Developing a stemmer for non-English languages Creating a morphological analyzer Creating a morphological generator Creating a search engine (For more resources related to this topic, see here.) Introducing morphology Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes). Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on. Understanding stemmers Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method. Let's take a look at the code for stemming using the PorterStemmer class in NLTK: >>> import nltk>>> from nltk.stem import PorterStemmer>>> stemmerporter = PorterStemmer()>>> stemmerporter.stem('working')'work'>>> stemmerporter.stem('happiness')'happi' The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here: Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming. Let's consider the following code, which depicts Lancaster stemming in NLTK: >>> import nltk >>> from nltk.stem import LancasterStemmer >>> stemmerlan=LancasterStemmer() >>> stemmerlan.stem('working') 'work' >>> stemmerlan.stem('happiness') 'happy' We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found. Let's consider an example of stemming using RegexpStemmer in NLTK: >>> import nltk >>> from nltk.stem import RegexpStemmer >>> stemmerregexp=RegexpStemmer('ing') >>> stemmerregexp.stem('working') 'work' >>> stemmerregexp.stem('happiness') 'happiness' >>> stemmerregexp.stem('pairing') 'pair' We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer. The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed. Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer: >>> import nltk >>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanishstemmer=SnowballStemmer('spanish') >>> spanishstemmer.stem('comiendo') 'com' >>> frenchstemmer=SnowballStemmer('french') >>> frenchstemmer.stem('manger') 'mang' Nltk.stem.api consists of the stemmer I class in which the stem function is performed. Consider the following code present in NLTK, which enables stemming to be performed: Class StemmerI(object): """ It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming. """ def stem(self, token): """ Eliminate affixes from token and stem is returned. """ raise NotImplementedError() Here's the code used to perform stemming using multiple stemmers: >>> import nltk >>> from nltk.stem.porter import PorterStemmer >>> from nltk.stem.lancaster import LancasterStemmer >>> from nltk.stem import SnowballStemmer >>> def obtain_tokens(): With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read()) return tokens >>> def stemming(filtered): stem=[] for x in filtered: stem.append(PorterStemmer().stem(x)) return stem >>> if_name_=="_main_": tok= obtain_tokens() >>> print("tokens is %s")%(tok) >>> stem_tokens= stemming(tok) >>> print("After stemming is %s")%stem_tokens >>> res=dict(zip(tok,stem_tokens)) >>> print("{tok:stemmed}=%s")%(result) Understanding lemmatization Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially. Consider an example of lemmatization in NLTK: >>> import nltk >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output=WordNetLemmatizer() >>> lemmatizer_output.lemmatize('working') 'working' >>> lemmatizer_output.lemmatize('working',pos='v') 'work' >>> lemmatizer_output.lemmatize('works') 'work' WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'. This code snippet illustrates the difference between stemming and lemmatization: >>> import nltk >>> from nltk.stem import PorterStemmer >>> stemmer_output=PorterStemmer() >>> stemmer_output.stem('happiness') 'happi' >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output.lemmatize('happiness') 'happiness' In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness". Developing a stemmer for non-English languages Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used. Here's the code to obtain a language table using a polyglot: from polyglot.downloader import downloader print(downloader.supported_languages_table("morph2")) The output obtained from the preceding code is in the form of these languages listed as follows: 1. Piedmontese language 2. Lombard language 3. Gan Chinese 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz 7. Pashto, Pushto 8. Kurdish 9. Portuguese 10. Kannada 11. Korean 12. Khmer 13. Kazakh 14. Ilokano 15. Polish 16. Panjabi, Punjabi 17. Georgian 18. Chuvash 19. Alemannic 20. Czech 21. Welsh 22. Chechen 23. Catalan; Valencian 24. Northern Sami 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian 31. Swedish 32. Swahili 33. Sundanese 34. Serbian 35. Albanian 36. Japanese 37. Western Frisian 38. French 39. Finnish 40. Upper Sorbian 41. Faroese 42. Persian 43. Sinhala, Sinhalese 44. Italian 45. Amharic 46. Aragonese 47. Volapük 48. Icelandic 49. Sakha 50. Afrikaans 51. Indonesian 52. Interlingua 53. Azerbaijani 54. Ido 55. Arabic 56. Assamese 57. Yoruba 58. Yiddish 59. Waray-Waray 60. Croatian 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua 64. Armenian 65. Hebrew (modern) 66. Silesian 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German 70. Danish 71. Occitan 72. Tagalog 73. Turkmen 74. Thai 75. Tajik 76. Greek, Modern 77. Telugu 78. Tamil 79. Oriya 80. Ossetian, Ossetic 81. Tatar 82. Turkish 83. Kapampangan 84. Venetian 85. Manx 86. Gujarati 87. Galician 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali 91. Cebuano 92. Zazaki 93. Walloon 94. Dutch 95. Norwegian 96. Norwegian Nynorsk 97. West Flemish 98. Chinese 99. Bosnian 100. Breton 101. Belarusian 102. Bulgarian 103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... 106. Bengali 107. Burmese 108. Romansh 109. Marathi (Marāthī) 110. Malay 111. Maltese 112. Russian 113. Macedonian 114. Malayalam 115. Mongolian 116. Malagasy 117. Vietnamese 118. Spanish; Castilian 119. Estonian 120. Basque 121. Bishnupriya Manipuri 122. Asturian 123. English 124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin 127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... 130. Latvian 131. Urdu 132. Lithuanian 133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... The necessary models can be downloaded using the following code: %%bash polyglot download morph2.en morph2.ar [polyglot_data] Downloading package morph2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.en is already up-to-date! [polyglot_data] Downloading package morph2.ar to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.ar is already up-to-date! Consider this example that obtains output from a polyglot: from polyglot.text import Text, Word tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"] for s in tokens: s=Word(s, language="en") print("{:<20}{}".format(s,s.morphemes)) unconditional ['un','conditional'] precooked ['pre','cook','ed'] impossible ['im','possible'] painful ['pain','ful'] entered ['enter','ed'] If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents: sent="Ihopeyoufindthebookinteresting" para=Text(sent) para.language="en" para.morphemes WordList(['I','hope','you','find','the','book','interesting']) A morphological analyzers Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used. Consider the following code that performs morphological analysis: >>> import enchant >>> s = enchant.Dict("en_US") >>> tok=[] >>> def tokenize(st1): if not st1:return for j in xrange(len(st1),-1,-1): if s.check(st1[0:j]): tok.append(st1[0:i]) st1=st[j:] tokenize(st1) break >>> tokenize("itismyfavouritebook") >>> tok ['it', 'is', 'my','favourite','book'] >>> tok=[ ] >>> tokenize("ihopeyoufindthebookinteresting") >>> tok ['i','hope','you','find','the','book','interesting'] We can determine the category of a word as follows: Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns. Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category. Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category. Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class. Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun. Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction. A morphological generator A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes. There are many Python-based software that perform morphological analysis and generation. Some of them are as follows: ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs MorfoMelayu: This is used for the morphological analysis of Malay words Other examples of software that is used to perform morphological analysis and generation are as follows: Morph is a morphological generator and analyzer for the English language and the RASP system Morphy is a morphological generator, analyzer, and POS tagger for German Morphisto is a morphological generator and analyzer for German Morfette performs supervised learning (inflectional morphology) for Spanish and French Search engines PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors. Here are the steps needed to construct a vector space search engine: Stemming and elimination of stop words. A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text. Consider the following code for the removal of stop words and tokenization: def eliminatestopwords(self,list): " " " Eliminate words which occur often and have not much significance from context point of view. " " " return[ word for word in list if word not in self.stopwords ] def tokenize(self,string): " " " Perform the task of splitting text into stop words and tokens " " " Str=self.clean(str) Words=str.split(" ") return [self.stemmer.stem(word,0,len(word)-1) for word in words] Mapping keywords into vector dimensions.Here's the code required to perform the mapping of keywords into vector dimensions: def obtainvectorkeywordindex(self, documentList): " " " In the document vectors, generate the keyword for the given position of element " " " #Perform mapping of text into strings vocabstring = " ".join(documentList) vocablist = self.parser.tokenise(vocabstring) #Eliminate common words that have no search significance vocablist = self.parser.eliminatestopwords(vocablist) uniqueVocablist = util.removeDuplicates(vocablist) vectorIndex={} offset=0 #Attach a position to keywords that performs mapping with dimension that is used to depict this token for word in uniqueVocablist: vectorIndex[word]=offset offset+=1 return vectorIndex #(keyword:position) Mapping of text strings to vectorsHere, a simple term count model is used. The code to convert text strings into vectors is as follows: def constructVector(self, wordString): # Initialise the vector with 0's Vector_val = [0] * len(self.vectorKeywordIndex) tokList = self.parser.tokenize(tokString) tokList = self.parser.eliminatestopwords(tokList) for word in toklist: vector[self.vectorKeywordIndex[word]] += 1; # simple Term Count Model is used return vector Searching similar documents By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related). This is the code to compute the cosine between the text vector using scipy: def cosine(vec1, vec2): """ cosine = ( X * Y ) / ||X|| x ||Y|| """ return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2))) Search keywords We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement. Here is the following code needed to search for the vector space: def searching(self,searchinglist): """ search for text that are matched on the basis of list of items """ askVector = self.buildQueryVector(searchinglist) ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors] ratings.sort(reverse=True) return ratings The following code can be used to detect languages from a source text: >>> import nltk >>> import sys >>> try: from nltk import wordpunct_tokenize from nltk.corpus import stopwords except ImportError: print( 'Error has occured') #---------------------------------------------------------------------- >>> def _calculate_languages_ratios(text): """ Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1} """ languages_ratios = {} ''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens wordpunct_tokenize("I hope you like the book interesting .") [' I',' hope ','you ','like ','the ','book' ,'interesting ','.'] ''' tok = wordpunct_tokenize(text) wor = [word.lower() for word in tok] # Compute occurence of unique stopwords in a text for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios #---------------------------------------------------------------- >>> def detect_language(text): """ Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text. """ ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) return most_rated_language if __name__=='__main__': text = ''' All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A personal or institutionalized system grounded in belief in a God or Gods and the activities connected with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality. ''' >>> language = detect_language(text) >>> print(language) The preceding code will search for stop words and detect the language of the text, which is English. Summary In this article, we discussed stemming, lemmatization, and morphological analysis and generation. Resources for Article: Further resources on this subject: How is Python code organized[article] Machine learning and Python – the Dream Team[article] Putting the Fun in Functional Python[article]
Read more
  • 0
  • 0
  • 4569
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-machine-learning-tasks
Packt
01 Apr 2016
16 min read
Save for later

Machine Learning Tasks

Packt
01 Apr 2016
16 min read
In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]
Read more
  • 0
  • 0
  • 5074

article-image-why-mesos
Packt
31 Mar 2016
8 min read
Save for later

Why Mesos?

Packt
31 Mar 2016
8 min read
In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos. Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS". It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects. (For more resources related to this topic, see here.) Introduction to the datacenter OS and architecture of Mesos Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy. Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced. Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources. Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support. The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure. The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources. From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change. Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues. In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals. Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications. It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core. The architecture of Mesos Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework. Some of the important features of Mesos are: Scalability: It can elastically scale to over 50,000 nodes Resource isolation: This is achieved through Linux/Docker containers Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks High availability: This is through Apache ZooKeeper Interface: A web UI for monitoring the cluster state Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos' design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are: Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs. It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster. An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators. Summary Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos. Resources for Article:   Further resources on this subject: Understanding Mesos Internals [article] Leveraging Python in the World of Big Data [article] Self-service Business Intelligence, Creating Value from Data [article]
Read more
  • 0
  • 0
  • 10027

article-image-support-vector-machines-classification-engine
Packt
17 Mar 2016
9 min read
Save for later

Support Vector Machines as a Classification Engine

Packt
17 Mar 2016
9 min read
In this article by Tomasz Drabas, author of the book, Practical Data Analysis Cookbook, we will discuss on how Support Vector Machine models can be used as a classification engine. (For more resources related to this topic, see here.) Support Vector Machines Support Vector Machines (SVMs) are a family of extremely powerful models that can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. While many classifiers exist that can classify linearly separable data (for example, logistic regression), SVMs can handle highly non-linear problems using a kernel trick that implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. The mechanics of the machine Given a set of n points of a form (x1,y1)...(xn,yn), where xi is a z-dimensional input vector and  yi is a class label, the SVM aims at finding the maximum margin hyperplane that separates the data points: In a two-dimensional dataset, with linearly separable data points (as shown in the preceding figure), the maximum margin hyperplane would be a line that would maximize the distance between each of the classes. The hyperplane could be expressed as a dot product of the set of input vectors  x and a vector normal to the hyperplane W:W.X=b, where b is the offset from the origin of the coordinate system. To find the hyperplane, we solve the following optimization problem: The constraint of our optimization problem effectively states that no point can cross the hyperplane if it does not belong to the class on that side of the hyperplane. Linear SVM Building a linear SVM classifier in Python is easy. There are multiple Python packages that can estimate a linear SVM but here, we decided to use MLPY (http://mlpy.sourceforge.net): import pandas as pd import numpy as np import mlpy as ml First, we load the necessary modules that we will use later, namely pandas (http://pandas.pydata.org), NumPy (http://www.numpy.org), and the aforementioned MLPY. We use pandas to read the data (https://github.com/drabastomek/practicalDataAnalysisCookbook repository to download the data): # the file name of the dataset r_filename = 'Data/Chapter03/bank_contacts.csv' # read the data csv_read = pd.read_csv(r_filename) The dataset that we use was described in S. Moro, P. Cortez, and P. Rita. A data-driven approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 and found here http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. It consists of over 41.1k outbound marketing calls of a bank. Our aim is to classify these calls into two buckets: those that resulted in a credit application and those that did not. Once the file was loaded, we split the data into training and testing datasets; we also keep the input and class indicator data separately. To this end, we use the split_dataset(...) method: def split_data(data, y, x = 'All', test_size = 0.33): ''' Method to split the data into training and testing ''' import sys # dependent variable variables = {'y': y} # and all the independent if x == 'All': allColumns = list(data.columns) allColumns.remove(y) variables['x'] = allColumns else: if type(x) != list: print('The x parameter has to be a list...') sys.exit(1) else: variables['x'] = x # create a variable to flag the training sample data['train'] = np.random.rand(len(data)) < (1 - test_size) # split the data into training and testing train_x = data[data.train] [variables['x']] train_y = data[data.train] [variables['y']] test_x = data[~data.train][variables['x']] test_y = data[~data.train][variables['y']] return train_x, train_y, test_x, test_y, variables['x'] We randomly set 1/3 of the dataset aside for testing purposes and use the remaining 2/3 for the training of the model: # split the data into training and testing train_x, train_y, test_x, test_y, labels = hlp.split_data( csv_read, y = 'credit_application' ) Once we read the data and split it into training and testing datasets, we can estimate the model: # create the classifier object svm = ml.LibSvm(svm_type='c_svc', kernel_type='linear', C=100.0) # fit the data svm.learn(train_x,train_y) The svm_type parameter of the .LibSvm(...) method controls what algorithm to use to estimate the SVM. Here, we use the c_svc method—a C-support Vector Classifier. The method specifies how much you want to avoid misclassifying observations: the larger values of C parameter will shrink the margin for the hyperplane (theb) so that more of the observations are correctly classified. You can also specify nu_svc with a nu parameter that controls how much of your sample (at most) can be misclassified and how many of your observations (at least) can become support vectors. Here, we estimate an SVM with a linear kernel, so let's talk about kernels. Kernels A kernel function K is effectively a function that computes a dot product between two n-dimensional vectors, K: Rn.Rn --> R. In other words, the kernel function takes two vectors and produces a scalar: The linear kernel does not effectively transform the data into a higher dimensional space. This is not true for polynomial or Radial Basis Function (RBF) kernels that transform the input feature space into higher dimensions. In case of the polynomial kernel of degree d, the obtained feature space has (n+d/d) dimensions for the Rn dimensional input feature space. As you can see, the number of additional dimensions can grow very quickly and this would pose significant problems in estimating the model if we would explicitly transform the data into higher dimensions. Thankfully, we do not have to do this as that's where the kernel trick comes into play. The truth is that SVMs do not have to work explicitly in higher dimensions but can rather implicitly map the data to higher dimensions using pairwise inner products (instead of an explicit dot product) and then use it to find the maximum margin hyperplane. You can find a really good explanation of the kernel trick at http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Back to our example The .learn(...) method of the .LibSvm(...) object estimates the model. Once the model is estimated, we can test how well it performs. First, we use the estimated model to predict the classes for the observations in the testing dataset: predicted_l = svm.pred(test_x) Next, we will use some of the scikit-learn methods to print the basic statistics for our model: def printModelSummary(actual, predicted): ''' Method to print out model summaries ''' import sklearn.metrics as mt print('Overall accuracy of the model is {0:.2f} percent' .format( (actual == predicted).sum() / len(actual) * 100)) print('Classification report: n', mt.classification_report(actual, predicted)) print('Confusion matrix: n', mt.confusion_matrix(actual, predicted)) print('ROC: ', mt.roc_auc_score(actual, predicted)) First, we calculate the overall accuracy of the model expressed as a ratio of properly classified observations to the total number of observations in the testing sample. Next, we print the classification report: The precision is the model's ability to avoid classifying an observation as positive when it is not. It is a ratio of true positives to the overall number of positively classified records. The overall precision score is a weighted average of the individual precision scores where the weight is the support. The support is the total number of actual observations in each class. The total precision for our model is not too bad—89 out of 100. However, when we look at the precision to classify the true positives, the situation is not as good—only 63 out of 100 were properly classified. Recall can be viewed as the model's capacity to find all the positive samples. It is a ratio of true positives to the sum of true positives and false negatives. The recall for the class 0.0 is almost perfect but for class 1.0, it looks really bad. This might be a problem with the fact that our sample is not balanced, but it is more likely that the features we use to classify the data do not really capture the differences between the two groups. The f1-score is effectively a weighted amalgam of the precision and recall: it is a ratio of twice the product of precision and recall to their sum. In one measure, it shows whether the model performs well or not. At the general level, the model does not perform badly but when looked at the model's ability to classify the true signal, it fails gravely. It is a perfect example why judging the model at the general level might be misleading when dealing with samples that are heavily unbalanced. RBF kernel SVM Given that the linear kernel performed poorly, our dataset might not be linearly separable. Thus, let's try the RBF kernel. The RBF kernel is given as K(x,y)=e ||x-y||2/2a2, where ||x-y||2 is a Euclidean distance between the two vectors, x and y, and σ is a free parameter. The value of RBF equals to 1 when x=y and gradually falls to 0 when the distance approaches infinity. To fit an RBF version of our model, we can specify our svm object as follows: svm = ml.LibSvm(svm_type='c_svc', kernel_type='rbf', gamma=0.1, C=1.0) The gamma parameter here specifies how far the influence of a single support vector reaches. Visually, you can investigate the relationship between gamma and C parameters at http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. The rest of the code for the model estimation follows in a similar fashion as with the linear kernel and we obtain the following results: The results are even worse than the linear kernel as the precision and recall were lost across the board. The SVM with the RBF kernel performed worse when classifying calls that resulted in applying for the credit card and those that did not. Summary In this article, we saw that the problem is not with the model but rather, the dataset that we use does not explain the variance sufficiently. This requires going back to the drawing board and selecting other features. Resources for Article: Further resources on this subject: Push your data to the Web [article] Transferring Data from MS Access 2003 to SQL Server 2008 [article] Exporting data from MS Access 2003 to MySQL [article]
Read more
  • 0
  • 0
  • 16014

article-image-integrating-imagery-creating-and-styling-features-openlayers-3
Packt
17 Mar 2016
21 min read
Save for later

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Packt
17 Mar 2016
21 min read
This article by Peter J Langley, author of the book OpenLayers 3.x Cookbook, sheds some light on three of the most important and talked about features of the OpenLayers library. (For more resources related to this topic, see here.) Introduction This article shows us the basics and the important things that we need to know when we start creating our first web-mapping application with OpenLayers. As we will see in this and the following recipes, OpenLayers is a big and complex framework, but at the same time, it is also very powerful and flexible. Although we're now spoilt for choice when it comes to picking a JavaScript mapping library (as we are with most JavaScript libraries and frameworks), OpenLayers is a mature, fully-featured, and well-supported library. In contrast to other libraries, such as Leaflet (http://leafletjs.com), which focus on a smaller download size in order to provide only the most common functionality as standard, OpenLayers tries to implement all the required things that a developer could need to create a web Geographic Information System (GIS) application. One aspect of OpenLayers 3 that immediately differentiates itself from OpenLayers 2, is that it's been built with the Google Closure library (https://developers.google.com/closure). Google Closure provides an extensive range of modular cross-browser JavaScript utility methods that OpenLayers 3 selectively includes. In GIS, a real-world phenomenon is represented by the concept of a feature. It can be a place, such as a city or a village, it can be a road or a railway, it can be a region, a lake, the border of a country, or something entirely arbitrary. Features can have a set of attributes, such as population, length, and so on. These can be represented visually through the use of points, lines, polygons, and so on, using some visual style: color, radius, width, and so on. OpenLayers offers us a great degree of flexibility when styling features. We can use static styles or dynamic styles influenced by feature attributes. Styles can be created through various methods, such as from style functions (ol.style.StyleFunction), or by applying new style instances (ol.style.Style) directly to a feature or layer. Let's take a look at all of this in the following recipes. Adding WMS layers Web Map Service (WMS) is a standard developed by the Open Geospatial Consortium (OGC), which is implemented by many geospatial servers, among which we can find the free and open source projects, GeoServer (http://geoserver.org) and MapServer (http://mapserver.org). More information on WMS can be found at http://en.wikipedia.org/wiki/Web_Map_Service. As a basic summary, a WMS server is a normal HTTP web server that accepts requests with some GIS-related parameters (such as projection, bounding box, and so on) and returns map tiles forming a mosaic that covers the requested bounding box. Here's the finished recipe's outcome using a WMS layer that covers the extent of the USA: We are going to work with remote WMS servers, so it is not necessary that you have one installed yourself. Note that we are not responsible for these servers, and that they may have problems, or they may not be available any longer when you read this section. Any other WMS server can be used, but the URL and layer name must be known. How to do it… We will add two WMS layers to work with. To do this, perform the following steps: Create an HTML file and add the OpenLayers dependencies. In particular, create the HTML to hold the map and the layer panel: <div id="js-map" class="map"></div> <div class="pane"> <h1>WMS layers</h1> <p>Select the WMS layer you wish to view:</p> <select id="js-layers" class="layers"> <option value="-10527519,3160212,4">Temperature " (USA)</option> <option value="-408479,7213209,6">Bedrock (UK)</option> </select> </div> Create the map instance with the default OpenStreetMap layer: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10527519, 3160212] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.OSM() }) ] }); Add the first WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/' + 'NDFDTemps/MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); Add the second WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Finally, add the layer-switching logic: document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); How it works… The HTML and CSS divide the page into two sections: one for the map, and the other for the layer-switching panel. The top part of our custom JavaScript file creates a new map instance with a single OpenStreetMap layer; this layer will become the background for the WMS layers in order to provide some context. Let's spend the rest of our time concentrating on how the WMS layers are created. WMS layers are encapsulated within the ol.layer.Tile layer type. The source is an instance of ol.source.TileWMS, which is a subclass of ol.source.TileImage. The ol.source.TileImage class is behind many source types, such as Bing Maps, and custom OpenStreetMap layers that are based on XYZ format. When using ol.source.TileWMS, we must at least pass in the URL of the WMS server and a layers parameter. Let's breakdown the first WMS layer as follows: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/' + 'MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); For the url property of the source, we provide the URL of the WMS server from NOAA (http://www.noaa.gov). The params property expects an object of key/value pairs. The content of this is appended to the previous URL as query string parameters, for example, http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/MapServer/WMSServer?LAYERS=16. As mentioned earlier, at minimum, this object requires the LAYERS property with a value. We request for the layer by the name of 16. Along with this parameter, we also explicitly ask for the tile images to be in the .PNG format (FORMAT: 'image/png') and that the background of the tiles be transparent (TRANSPARENT: true) rather than white, which would undesirably block out the background map layer. The default values for format and transparency are already image or PNG and false, respectively. This means you don't need to pass them in as parameters, OpenLayers will do it for you. We've shown you this for learning purposes, but this isn't strictly necessary. There are also other parameters that OpenLayers fills in for you if not specified, such as service (WMS), version (1.3.0), request (GetMap), and so on. For the attributions property, we created a new attribution instance to cover our usage of the WMS service, which simply contains a string of HTML linking back to the NOAA website. Lastly, we set the opacity property of the layer to 50% (0.50), which suitably overlays the OpenStreetMap layer underneath: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Check the WMS standard to know which parameters you can use within the params property. The use of layers is mandatory, so you always need to specify this value. This layer from the British Geological Survey (http://bgs.ac.uk) follows the same structure as the previous WMS layer. Similarly, we provided a source URL and a layers parameter for the HTTP request. The layer name is a string rather than a number this time, which is delimited by underscores. The naming convention is at the discretion of the WMS service itself. Like earlier, an attribution instance has been added to the layer, which contains a string of HTML linking back to the BGS website, covering our usage of the WMS service. The opacity property of this layer is a little less transparent than the last one, at 85% (0.85): document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); Finally, we added a change-event listener and handler to the select menu containing both the WMS layers. If you recall from the HTML, an option's value contains a comma-delimited string. For example, the Bedrock WMS layer option looks like this: <option value="-408479,7213209,6">Bedrock (UK)</option> This translates to x coordinate, y coordinate, and zoom level. With this in mind when the change event fires, we store the value of the newly-selected option in a variable named values. The split JavaScript method creates a three-item array from the string. The array now contains the xy coordinates and the zoom level, respectively. We store a reference to the view into a variable, namely view, as it's accessed more than once within the event handler. The map view is then centered to the new location with the setCenter method. We've made sure to convert the string values into float types for OpenLayers, via the parseFloat JavaScript method. The zoom level is then set via the setZoom method. Continuing with the Bedrock example, it will recenter at -408479, 7213209 with zoom level 6. Integrating with custom WMS services plays an essential role in many web-mapping applications. Learning how we did this in this recipe should give you a good idea of how to integrate with any other WMS services that you may use. There's more… It's worth mentioning that WMS services do not necessarily cover a global extent, and they will more likely cover only subset extents of the world. Case in point, the NOAA WMS layer covers only USA, and the BGS WMS layer only covers the UK. During this topic, we only looked at the request type of GetMap, but there's also a request type called GetCapabilities. Using the GetCapabilities request parameter on the same URL endpoint returns the capabilities (such as extent) that a WMS server supports. This is discussed in much more detail later in this book. If you don't specify the type of projection, the view default projection will be used. In our case, this will be EPSG:3857, which is passed up in a parameter named CRS (it's named SRS for the GetMap version requests less than 1.3.0). If you want to retrieve WMS tiles in different projections, you need to ensure that the WMS server supports that particular format. WMS servers return images no matter whether there is information in the bounding box that we are requesting or not. Taking this recipe as an example, if the viewable extent of the map is only the UK, blank images will get returned for WMS layer requests made for USA (via the NOAA tile requests). You can prevent these unnecessary HTTP requests by setting the visibility of any layers that do not cover the extent of the area being viewed to false. There are some useful methods of the ol.source.TileWMS class that are worth being aware of, such as updateParams, which can be used to set parameters for the WMS request, and getUrls, which return the URLs used for the WMS source. Creating features programmatically Loading data from an external source is not the only way to work with vector layers. Imagine a web-mapping application where users can create new features on the fly: landing zones, perimeters, areas of interest, and so on, and add them to a vector layer with some style. This scenario requires the ability to create and add the features programmatically. In this recipe, we will take a look at some of the ways to create a selection of features programmatically. How to do it… Here, we'll create some features programmatically without any file importing. The following instructions show you how this is done: Start by creating a new HTML file with the required OpenLayers dependencies. In particular, add the div element to hold the map: <div id="js-map"></div> Create an empty JavaScript file and instantiate a map with a background raster layer: var map = new ol.Map({ view: new ol.View({ zoom: 3, center: [-2719935, 3385243] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] }); Create the point and circle features: var point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Create the line and polygon features: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); Create the vector layer and add features to the layer: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); How it works… Although we've created some random features for this recipe, features in mapping applications would normally represent some phenomenon of the real world with an appropriate geometry and a style associated with it. Let's go over the programmatic feature creation and how they are added to a vector: layervar point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); Features are instances of ol.Feature. This constructor contains many useful methods, such as clone, setGeometry, getStyle, and others. When creating an instance of ol.Feature, we must either pass in a geometry of type ol.geom.Geometry or an object containing properties. We demonstrate both variations throughout this recipe. For the point feature, we pass in a configuration object. The only property that we supply is geometry. There are other properties available, such as style, and the use of custom properties to set the feature attributes ourselves, which come with getters and setters. The geometry is an instance of ol.geom.Point. The ol.geom class provides a variety of other feature types that we don't get to see in this recipe, such as MultiLineString and MultiPoint. The pointgeometry type simply requires an ol.Coordinate type array (xy coordinates): var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Remember to express the coordinates in the appropriate projection, such as the one used by the view, or translate the coordinates yourself. So, for now, all features will be rendered with the default OpenLayers styling. The circle feature follows almost the same structure as the point feature. This time, however, we don't pass in a configuration object to ol.Feature, but instead, we directly instantiate an ol.geom.Geometry type of Circle. The circle geometry takes an array of coordinates and a second parameter for the radius. 9e5 or 9e+5 is exponential notation for 900,000. The circle geometry also has useful methods, such as getCenter and setRadius: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); The only noticeable difference with the LineString feature is that ol.geom.LineString expects an array of coordinate arrays. For more advanced line strings, use the ol.geom.MultiLineString geometry type (more information about them can be found on the OpenLayers API documentation: http://openlayers.org/en/v3.13.0/apidoc/): The LineString feature also has useful methods, such as getLength: var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); The final feature, a Polygon geometry type differs slightly from the LineString feature as it expects an ol.Coordinate type array within an array within another wrapping array. This is because the constructor (ol.geom.Polygon) expects an array of rings with each ring representing an array of coordinates. Ideally, each ring should be closed. The polygon feature also has useful methods, such as getArea and getLinearRing: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); The OGC's Simple Feature Access specification (http://www.opengeospatial.org/standards/sfa) contains an in-depth description of the standard. It also contains a UML class diagram where you can see all the geometry classes and hierarchy. Finally, we create the vector layer, with a vector source instance and then add all four features into an array and pass it to the features property. All the features we've created are subclasses of ol.geom.SimpleGeometry. This class provides useful base methods, such as getExtent and getFirstCoordinate. All features have a getType method that can be used to identify the type of feature, for example, 'Point' or 'LineString'. There's more… Sometimes, the polygon features may represent a region with a hole in it. To create the hollow part of a polygon, we use the LinearRing geometry. The outcome is best explained with the following screenshot: You can see that the polygon has a section cut out of it. To achieve this geometry, we must create the polygon in a slightly different way. Here are the steps: Create the polygon geometry: var polygon = new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]); Create and add the linear ring to the polygon geometry: polygon.appendLinearRing( new ol.geom.LinearRing([ [645740, 3766816], [1017529, 3786384], [1017529, 3532002], [626172, 3532002] ]) ); Create the completed feature: var polygonFeature = new ol.Feature(polygon); Finish off by adding the polygon feature to the vector layer: vectorLayer.getSource().addFeature(polygonFeature); We won't break this logic down any further, as it's quite self explanatory. Now, we're comfortable with geometry creation. The ol.geom.LinearRing feature can only be used in conjunction with a polygon geometry, not as a standalone feature. Styling features based on geometry type We can summarize that there are two ways to style a feature. The first is by applying the style to the layer so that every feature inherits the styling. The second is to apply the styling options directly to the feature, which we'll see with this recipe. This recipe shows you how we can choose which flavor of styling to apply to a feature depending on the geometry type. We will apply the style directly to the feature using the ol.Feature method, setStyle. When a point geometry type is detected, we will actually style the representing geometry as a star, rather than the default circle shape. Other styling will be applied when a geometry type of line string is detected and here's what the output of the recipe will look like: How to do it… To customize the feature styling based on the geometry type, follow these steps: Create the HTML file with OpenLayers dependencies, the jQuery library, and a div element that will hold the map instance. Create a custom JavaScript file and initialize a new map instance: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10732981, 4676723] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] Create a new vector layer and add it to the map. Have the source loader function retrieve the GeoJSON file, format the response, then pass it through our custom modifyFeatures method (which we'll implement next) before adding the features to the vector source: var vectorLayer = new ol.layer.Vector({ source: new ol.source.Vector({ loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } }) }); map.addLayer(vectorLayer); Finish off by implementing the modifyFeatures function so that it transforms the projection of the geometry and styles the feature that are based on the geometry type: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } }); return features; } How it works… Let's briefly look over the loader function of the vector source before we take a closer examination of the logic behind the styling: loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } Our external resource contains points and line strings in the format of GeoJSON. So we must create a new instance of ol.format.GeoJSON so that we can read in the data (format.readFeatures(data)) of the AJAX response to build out a collection of OpenLayers features. Before adding the group of features straight into the vector source (this refers to the vector source here), we pass the array of features through our modifyFeatures method. This method will apply all the necessary styling to each feature, then return the modified features in place, and feed the result into the addFeatures method. Let's break down the contents our modifyFeatures method: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); The logic begins by looping over each feature in the array using the JavaScript array method, forEach. The first argument passed into the anonymous iterator function is the (feature) feature. Within the loop iteration, we store the feature's geometry into a variable, namely geometry, as it's accessed more than once during the loop iteration. Unbeknown to you, the projection of coordinates within the GeoJSON file are in longitude/latitude, the EPSG:4326 projection code. The map's view, however, is in the EPSG:3857 projection. To ensure they appear where intended on the map, we use the transform geometry method, which takes the source and the destination projections as arguments and converts the coordinates of the geometry in place: if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } Next up is a conditional check on whether or not the geometry is a type of Point. The geometry instance has the getType method for this kind of purpose. Inline of the setStyle method of the feature instance, we create a new style object from the ol.style.Style constructor. The only direct property that we're interested in is the image property. By default, point geometries are styled as a circle. Instead, we want to style the point as a star. We can achieve this through the use of the ol.style.RegularShape constructor. We set up a fill style with color and a stroke style with width and color. The points property specifies the number of points for the star. In the case of a polygon shape, it represents the number of sides. The radius1 and radius2 properties are specifically to design star shapes for the configuration of the inner and outer radius, respectively: if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } The final piece of the method has a conditional check on the geometry type of LineString. If this is the case, we style this geometry type differently to the point geometry type. We provide a stroke style with a color, width,property and a custom lineDash. The lineDash array declares a line length of 8 followed by a gap length of 6. Summary In this article we looked at how to integrate WMS layers to our map from a basic HTTP web server by passing in some GIS related parameters. We also saw how to create and add features to our vector layer with some styling the idea behind this particular recipe was to enable the user to create the features programmatically without any file importing. We also saw how to style these features by applying styling option to the features individually based on their geometry type rather than styling the layer. Resources for Article: Further resources on this subject: What is OpenLayers?[article] Getting Started with OpenLayers[article] Creating Simple Maps with OpenLayers 3[article]
Read more
  • 0
  • 0
  • 4607
article-image-welcome-to-machine-learning-using-the-net-framework
Oli Huggins
16 Mar 2016
26 min read
Save for later

Welcome to Machine Learning using the .NET Framework

Oli Huggins
16 Mar 2016
26 min read
This article by, Jamie Dixon, the author of the book, Mastering .NET Machine Learning, will focus on some of the larger questions you might have about machine learning using the .NET Framework, namely: What is machine learning? Why should we consider it in the .NET Framework? How can I get started with coding? (For more resources related to this topic, see here.) What is machine learning? If you check out on Wikipedia, you will find a fairly abstract definition of machine learning: "Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions." I like to think of machine learning as computer programs that produce different results as they are exposed to more information without changing their source code (and consequently needed to be redeployed). For example, consider a game that I play with the computer. I show the computer this picture  and tell it "Blue Circle". I then show it this picture  and tell it "Red Circle". Next I show it this picture  and say "Green Triangle." Finally, I show it this picture  and ask it "What is this?". Ideally the computer would respond, "Green Circle." This is one example of machine learning. Although I did not change my code or recompile and redeploy, the computer program can respond accurately to data it has never seen before. Also, the computer code does not have to explicitly write each possible data permutation. Instead, we create models that the computer applies to new data. Sometimes the computer is right, sometimes it is wrong. We then feed the new data to the computer to retrain the model so the computer gets more and more accurate over time—or, at least, that is the goal. Once you decide to implement some machine learning into your code base, another decision has to be made fairly early in the process. How often do you want the computer to learn? For example, if you create a model by hand, how often do you update it? With every new data row? Every month? Every year? Depending on what you are trying to accomplish, you might create a real-time ML model, a near-time model, or a periodic model. Why .NET? If you are a Windows developer, using .NET is something you do without thinking. Indeed, a vast majority of Windows business applications written in the last 15 years use managed code—most of it written in C#. Although it is difficult to categorize millions of software developers, it is fair to say that .NET developers often come from nontraditional backgrounds. Perhaps a developer came to .NET from a BCSC degree but it is equally likely s/he started writing VBA scripts in Excel, moving up to Access applications, and then into VB.NET/C# applications. Therefore, most .NET developers are likely to be familiar with C#/VB.NET and write in an imperative and perhaps OO style. The problem with this rather narrow exposure is that most machine learning classes, books, and code examples are in R or Python and very much use a functional style of writing code. Therefore, the .NET developer is at a disadvantage when acquiring machine learning skills because of the need to learn a new development environment, a new language, and a new style of coding before learning how to write the first line of machine learning code. If, however, that same developer could use their familiar IDE (Visual Studio) and the same base libraries (the .NET Framework), they can concentrate on learning machine learning much sooner. Also, when creating machine learning models in .NET, they have immediate impact as you can slide the code right into an existing C#/VB.NET solution. On the other hand, .NET is under-represented in the data science community. There are a couple of different reasons floating around for that fact. The first is that historically Microsoft was a proprietary closed system and the academic community embraced open source systems such as Linux and Java. The second reason is that much academic research uses domain-specific languages such as R, whereas Microsoft concentrated .NET on general purpose programming languages. Research that moved to industry took their language with them. However, as the researcher's role is shifted from data science to building programs that can work at real time that customers touch, the researcher is getting more and more exposure to Windows and Windows development. Whether you like it or not, all companies which create software that face customers must have a Windows strategy, an iOS strategy, and an Android strategy. One real advantage to writing and then deploying your machine learning code in .NET is that you can get everything with one stop shopping. I know several large companies who write their models in R and then have another team rewrite them in Python or C++ to deploy them. Also, they might write their model in Python and then rewrite it in C# to deploy on Windows devices. Clearly, if you could write and deploy in one language stack, there is a tremendous opportunity for efficiency and speed to market. What version of the .NET Framework are we using? The .NET Framework has been around for general release since 2002. The base of the framework is the Common Language Runtime or CLR. The CLR is a virtual machine that abstracts much of the OS specific functionality like memory management and exception handling. The CLR is loosely based on the Java Virtual Machine (JVM). Sitting on top of the CLR is the Framework Class Library (FCL) that allows different languages to interoperate with the CLR and each other: the FCL is what allows VB.Net, C#, F#, and Iron Python code to work side-by-side with each other. Since its first release, the .NET framework has included more and more features. The first release saw support for the major platform libraries like WinForms, ASP.NET, and ADO.NET. Subsequent releases brought in things like Windows Communication Foundation (WCF), Language Integrated Query (LINQ), and Task Parallel Library (TPL). At the time of writing, the latest version is of the .Net Framework is 4.6.2. In addition to the full-Monty .NET Framework, over the years Microsoft has released slimmed down versions of the .NET Framework intended to run on machines that have limited hardware and OS support. The most famous of these releases was the Portable Class Library (PCL) that targeted Windows RT applications running Windows 8. The most recent incantation of this is Universal Windows Applications (UWA), targeting Windows 10. At Connect(); in November 2015, Microsoft announced GA of the latest edition of the .NET Framework. This release introduced the .Net Core 5. In January, they decided to rename it to .Net Core 1.0. .NET Core 1.0 is intended to be a slimmed down version of the full .NET Framework that runs on multiple operating systems (specifically targeting OS X and Linux). The next release of ASP.NET (ASP.NET Core 1.0) sits on top of .NET Core 1.0. ASP.NET Core 1.0 applications that run on Windows can still run the full .NET Framework. (https://blogs.msdn.microsoft.com/webdev/2016/01/19/asp-net-5-is-dead-int...) In this book, we will be using a mixture of ASP.NET 4.0, ASP.NET 5.0, and Universal Windows Applications. As you can guess, machine learning models (and the theory behind the models) change with a lot less frequency than framework releases so the most of the code you write on .NET 4.6 will work equally well with PCL and .NET Core 1.0. Saying that, the external libraries that we will use need some time to catch up—so they might work with PCL but not with .NET Core 1.0 yet. To make things realistic, the demonstration projects will use .NET 4.6 on ASP.NET 4.x for existing (Brownfield) applications. New (Greenfield) applications will be a mixture of a UWA using PCL and ASP.NET 5.0 applications. Why write your own? It seems like all of the major software companies are pitching machine learning services such as Google Analytics, Amazon Machine Learning Services, IBM Watson, Microsoft Cortana Analytics, to name a few. In addition, major software companies often try to sell products that have a machine learning component, such as Microsoft SQL Server Analysis Service, Oracle Database Add-In, IBM SPSS, or SAS JMP. I have not included some common analytical software packages such as PowerBI or Tableau because they are more data aggregation and report writing applications. Although they do analytics, they do not have a machine learning component (not yet at least). With all these options, why would you want to learn how to implement machine learning inside your applications, or in effect, write some code that you can purchase elsewhere? It is the classic build versus buy decision that every department or company has to make. You might want to build because: You really understand what you are doing and you can be a much more informed consumer and critic of any given machine learning package. In effect, you are building your internal skill set that your company will most likely prize. Another way to look at it, companies are not one tool away from purchasing competitive advantage because if they were, their competitors could also buy the same tool and cancel any advantage. However, companies can be one hire away or more likely one team away to truly have the ability to differentiate themselves in their market. You can get better performance by executing locally, which is especially important for real-time machine learning and can be implemented in disconnected or slow connection scenarios. This becomes particularly important when we start implementing machine learning with Internet of Things (IoT) devices in scenarios where the device has a lot more RAM than network bandwidth. Consider the Raspberry Pi running Windows 10 on a pipeline. Network communication might be spotty, but the machine has plenty of power to implement ML models. You are not beholden to any one vendor or company, for example, every time you implement an application with a specific vendor and are not thinking about how to move away from the vendor, you make yourself more dependent on the vendor and their inevitable recurring licensing costs. The next time you are talking to the CTO of a shop that has a lot of Oracle, ask him/her if they regret any decision to implement any of their business logic in Oracle databases. The answer will not surprise you. A majority of this book's code is written in F#—an open source language that runs great on Windows, Linux, and OS X. You can be much more agile and have much more flexibility in what you implement. For example, we will often re-train our models on the fly and when you write your own code, it is fairly easy to do this. If you use a third-party service, they may not even have API hooks to do model training and evaluation, so near-time model changes are impossible. Once you decide to go native, you have a choice of rolling your own code or using some of the open source assemblies out there. This book will introduce both the techniques to you, highlight some of the pros and cons of each technique, and let you decide how you want to implement them. For example, you can easily write your own basic classifier that is very effective in production but certain models, such as a neural network, will take a considerable amount of time and energy and probably will not give you the results that the open source libraries do. As a final note, since the libraries that we will look at are open source, you are free to customize pieces of it—the owners might even accept your changes. However, we will not be customizing these libraries in this book. Why open data? Many books on machine learning use datasets that come with the language install (such as R or Hadoop) or point to public repositories that have considerable visibility in the data science community. The most common ones are Kaggle (especially the Titanic competition) and the UC Irvine's datasets. While these are great datasets and give a common denominator, this book will expose you to datasets that come from government entities. The notion of getting data from government and hacking for social good is typically called open data. I believe that open data will transform how the government interacts with its citizens and will make government entities more efficient and transparent. Therefore, we will use open datasets in this book and hopefully you will consider helping out with the open data movement. Why F#? As we will be on the .NET Framework, we could use either C#, VB.NET, or F#. All three languages have strong support within Microsoft and all three will be around for many years. F# is the best choice for this book because it is unique in the .NET Framework for thinking in the scientific method and machine learning model creation. Data scientists will feel right at home with the syntax and IDE (languages such as R are also functional first languages). It is the best choice for .NET business developers because it is built right into Visual Studio and plays well with your existing C#/VB.NET code. The obvious alternative is C#. Can I do this all in C#? Yes, kind of. In fact, many of the .NET libraries we will use are written in C#. However, using C# in our code base will make it larger and have a higher chance of introducing bugs into the code. At certain points, I will show some examples in C#, but the majority of the book is in F#. Another alternative is to forgo .NET altogether and develop the machine learning models in R and Python. You could spin up a web service (such as AzureML), which might be good in some scenarios, but in disconnected or slow network environments, you will get stuck. Also, assuming comparable machines, executing locally will perform better than going over the wire. When we implement our models to do real-time analytics, anything we can do to minimize the performance hit is something to consider. A third alternative that the .NET developers will consider is to write the models in T-SQL. Indeed, many of our initial models have been implemented in T-SQL and are part of the SQL Server Analysis Server. The advantage of doing it on the data server is that the computation is as close as you can get to the data, so you will not suffer the latency of moving large amount of data over the wire. The downsides of using T-SQL are that you can't implement unit tests easily, your domain logic is moving away from the application and to the data server (which is considered bad form with most modern application architecture), and you are now reliant on a specific implementation of the database. F# is open source and runs on a variety of operating systems, so you can port your code much more easily. Getting ready for Machine Learning In this section, we will install Visual Studio, take a quick lap around F#, and install the major open source libraries that we will be using. Setting up Visual Studio To get going, you will need to download Visual Studio on a Microsoft Windows machine. As of this writing, the latest (free) version is Visual Studio 2015 Community. If you have a higher version already installed on your machine, you can skip this step. If you need a copy, head on over to the Visual Studio home page at https://www.visualstudio.com. Download the Visual Studio Community 2015 installer and execute it. Now, you will get the following screen: Select Custom installation and you will be taken to the following screen: Make sure Visual F# has a check mark next to it. Once it is installed, you should see Visual Studio in your Windows Start menu. Learning F# One of the great features about F# is that you can accomplish a whole lot with very little code. It is a very terse language compared to C# and VB.NET, so picking up the syntax is a bit easier. Although this is not a comprehensive introduction, this is going to introduce you to the major language features that we will use in this book. I encourage you to check out http://www.tryfsharp.org/ or the tutorials at http://fsharpforfunandprofit.com/ if you want to get a deeper understanding of the language. With that in mind, let's create our 1st F# project: Start Visual Studio. Navigate to File | New | Project as shown in the following screenshot: When the New Project dialog box appears, navigate the tree view to Visual F# | Windows | Console Application. Have a look at the following screenshot: Give your project a name, hit OK, and the Visual Studio Template generator will create the following boilerplate: Although Visual Studio created a Program.fs file that creates a basic console .exe application for us, we will start learning about F# in a different way, so we are going to ignore it for now. Right-click in the Solution Explorer and navigate to Add | New Item. When the Add New Item dialog box appears, select Script File. The Script1.fsx file is then added to the project. Once Script1.fsx is created, open it up, and enter the following into the file: let x = "Hello World" Highlight that entire row of code, right-click and select Execute In Interactive (or press Alt + Enter). And the F# Interactive console will pop up and you will see this: The F# Interactive is a type of REPL, which stands for Read-Evaluate-Print-Loop. If you are a .NET developer who has spent any time in SQL Server Management Studio, the F# Interactive will look very familiar to the Query Analyzer where you enter your code at the top and see how it executes at the bottom. Also, if you are a data scientist using R Studio, you are very familiar with the concept of a REPL. I have used the words REPL and FSI interchangeably in this book. There are a couple of things to notice about this first line of F# code you wrote. First, it looks very similar to C#. In fact, consider changing the code to this: It would be perfectly valid C#. Note that the red squiggly line, showing you that the F# compiler certainly does not think this is valid. Going back to the correct code, notice that type of x is not explicitly defined. F# uses the concept of inferred typing so that you don't have to write the type of the values that you create. I used the term value deliberately because unlike variables, which can be assigned in C# and VB.NET, values are immutable; once bound, they can never change. Here, we are permanently binding the name x to its value, Hello World. This notion of immutability might seem constraining at first, but it has profound and positive implications, especially when writing machine learning models. With our basic program idea proven out, let's move it over to a compliable assembly; in this case, an .exe that targets the console. Highlight the line that you just wrote, press Ctrl + C, and then open up Program.fs. Go into the code that was generated and paste it in: [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" 0 // return an integer exit code Then, add the following lines of code around what you just added: // Learn more about F# at http://fsharp.org // See the 'F# Tutorial' project for more help. open System [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" Console.WriteLine(x) let y = Console.ReadKey() 0 // return an integer exit code Press the Start button (or hit F5) and you should see your program run: You will notice that I had to bind the return value from Console.ReadKey() to y. In C# or VB.NET, you can get away with not handling the return value explicitly. In F#, you are not allowed to ignore the returned values. Although some might think this is a limitation, it is actually a strength of the language. It is much harder to make a mistake in F# because the language forces you to address execution paths explicitly versus accidentally sweeping them under the rug (or into a null, but we'll get to that later). In any event, let's go back to our script file and enter in another line of code: let ints = [|1;2;3;4;5;6|] If you send that line of code to the REPL, you should see this: val ints : int [] = [|1; 2; 3; 4; 5; 6|] This is an array, as if you did this in C#: var ints = new[] {1,2,3,4,5,6}; Notice that the separator is a semicolon in F# and not a comma. This differs from many other languages, including C#. The comma in F# is reserved for tuples, not for separating items in an array. We'll discuss tuples later. Now, let's sum up the values in our array: let summedValue = ints |> Array.sum While sending that line to the REPL, you should see this: val summedValue : int = 21 There are two things going on. We have the |> operator, which is a pipe forward operator. If you have experience with Linux or PowerShell, this should be familiar. However, if you have a background in C#, it might look unfamiliar. The pipe forward operator takes the result of the value on the left-hand side of the operator (in this case, ints) and pushes it into the function on the right-hand side (in this case, sum). The other new language construct is Array.sum. Array is a module in the core F# libraries, which has a series of functions that you can apply to your data. The function sum, well, sums the values in the array, as you can probably guess by inspecting the result. So, now, let's add a different function from the Array type: let multiplied = ints |> Array.map (fun i -> i * 2) If you send it to the REPL, you should see this: val multiplied : int [] = [|2; 4; 6; 8; 10; 12|] Array.map is an example of a high ordered function that is part of the Array type. Its parameter is another function. Effectively, we are passing a function into another function. In this case, we are creating an anonymous function that takes a parameter i and returns i * 2. You know it is an anonymous function because it starts with the keyword fun and the IDE makes it easy for us to understand that by making it blue. This anonymous function is also called a lambda expression, which has been in C# and VB.NET since .Net 3.5, so you might have run across it before. If you have a data science background using R, you are already quite familiar with lambdas. Getting back to the higher-ordered function Array.map, you can see that it applies the lambda function against each item of the array and returns a new array with the new values. We will be using Array.map (and its more generic kin Seq.map) a lot when we start implementing machine learning models as it is the best way to transform an array of data. Also, if you have been paying attention to the buzz words of map/reduce when describing big data applications such as Hadoop, the word map means exactly the same thing in this context. One final note is that because of immutability in F#, the original array is not altered, instead, multiplied is bound to a new array. Let's stay in the script and add in another couple more lines of code: let multiplyByTwo x = x * 2 If you send it to the REPL, you should see this: val multiplyByTwo : x:int -> int These two lines created a named function called multiplyByTwo. The function that takes a single parameter x and then returns the value of the parameter multiplied by 2. This is exactly the same as our anonymous function we created earlier in-line that we passed into the map function. The syntax might seem a bit strange because of the -> operator. You can read this as, "the function multiplyByTwo takes in a parameter called x of type int and returns an int." Note three things here. Parameter x is inferred to be an int because it is used in the body of the function as multiplied to another int. If the function reads x * 2.0, the x would have been inferred as a float. This is a significant departure from C# and VB.NET but pretty familiar for people who use R. Also, there is no return statement for the function, instead, the final expression of any function is always returned as the result. The last thing to note is that whitespace is important so that the indentation is required. If the code was written like this: let multiplyByTwo(x) = x * 2 The compiler would complain: Script1.fsx(8,1): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (7:1). Since F# does not use curly braces and semicolons (or the end keyword), such as C# or VB.NET, it needs to use something to separate code. That separation is whitespace. Since it is good coding practice to use whitespace judiciously, this should not be very alarming to people having a C# or VB.NET background. If you have a background in R or Python, this should seem natural to you. Since multiplyByTwo is the functional equivalent of the lambda created in Array.map (fun i -> i * 2), we can do this if we want: let multiplied' = ints |> Array.map (fun i -> multiplyByTwo i) If you send it to the REPL, you should see this: val multiplied' : int [] = [|2; 4; 6; 8; 10; 12|] Typically, we will use named functions when we need to use that function in several places in our code and we use a lambda expression when we only need that function for a specific line of code. There is another minor thing to note. I used the tick notation for the value multiplied when I wanted to create another value that was representing the same idea. This kind of notation is used frequently in the scientific community, but can get unwieldy if you attempt to use it for a third or even fourth (multiplied'''') representation. Next, let's add another named function to the REPL: let isEven x = match x % 2 = 0 with | true -> "even" | false -> "odd" isEven 2 isEven 3 If you send it to the REPL, you should see this: val isEven : x:int -> string This is a function named isEven that takes a single parameter x. The body of the function uses a pattern-matching statement to determine whether the parameter is odd or even. When it is odd, then it returns the string odd. When it is even, it returns the string even. There is one really interesting thing going on here. The match statement is a basic example of pattern matching and it is one of the coolest features of F#. For now, you can consider the match statement much like the switch statement that you may be familiar within R, Python, C#, or VB.NET. I would have written the conditional logic like this: let isEven' x = if x % 2 = 0 then "even" else "odd" But I prefer to use pattern matching for this kind of conditional logic. In fact, I will attempt to go through this entire book without using an if…then statement. With isEven written, I can now chain my functions together like this: let multipliedAndIsEven = ints |> Array.map (fun i -> multiplyByTwo i) |> Array.map (fun i -> isEven i) If you send it to REPL, you should see this: val multipliedAndIsEven : string [] = [|"even"; "even"; "even"; "even"; "even"; "even"|] In this case, the resulting array from the first pipe Array.map (fun i -> multiplyByTwo i))gets sent to the next function Array.map (fun i -> isEven i). This means we might have three arrays floating around in memory: ints which is passed into the first pipe, the result from the first pipe that is passed into the second pipe, and the result from the second pipe. From your mental model point of view, you can think about each array being passed from one function into the next. In this book, I will be chaining pipe forwards frequently as it is such a powerful construct and it perfectly matches the thought process when we are creating and using machine learning models. You now know enough F# to get you up and running with the first machine learning models in this book. I will be introducing other F# language features as the book goes along, but this is a good start. As you will see, F# is truly a powerful language where a simple syntax can lead to very complex work. Third-party libraries The following are a few third-party libraries that we will cover in our book later on: Math.NET Math.NET is an open source project that was created to augment (and sometimes replace) the functions that are available in System.Math. Its home page is http://www.mathdotnet.com/. We will be using Math.Net's Numerics and Symbolics namespaces in some of the machine learning algorithms that we will write by hand. A nice feature about Math.Net is that it has strong support for F#. Accord.NET Accord.NET is an open source project that was created to implement many common machine learning models. Its home page is http://accord-framework.net/. Although the focus of Accord.NET was for computer vision and signal processing, we will be using Accord.Net extensively in this book as it makes it very easy to implement algorithms in our problem domain. Numl Numl is an open source project that implements several common machine learning models as experiments. Its home page is http://numl.net/. Numl is newer than any of the other third-party libraries that we will use in the book, so it may not be as extensive as the other ones, but it can be very powerful and helpful in certain situations. Summary We covered a lot of ground in this article. We discussed what machine learning is, why you want to learn about it in the .NET stack, how to get up and running using F#, and had a brief introduction to the major open source libraries that we will be using in this book. With all this preparation out of the way, we are ready to start exploring machine learning. Further resources on this subject: ASP.Net Site Performance: Improving JavaScript Loading [article] Displaying MySQL data on an ASP.NET Web Page [article] Creating a NHibernate session to access database within ASP.NET [article]
Read more
  • 0
  • 0
  • 11099

article-image-exploring-hdfs
Packt
10 Mar 2016
17 min read
Save for later

Exploring HDFS

Packt
10 Mar 2016
17 min read
In this article by Tanmay Deshpande, the author of the book Hadoop Real World Solutions Cookbook- Second Edition, we'll cover the following recipes: Loading data from a local machine to HDFS Exporting HDFS data to a local machine Changing the replication factor of an existing file in HDFS Setting the HDFS block size for all the files in a cluster Setting the HDFS block size for a specific file in a cluster Enabling transparent encryption for HDFS Importing data from another Hadoop cluster Recycling deleted data from trash to HDFS Saving compressed data in HDFS Hadoop has two important components: Storage: This includes HDFS Processing: This includes Map Reduce HDFS takes care of the storage part of Hadoop. So, let's explore the internals of HDFS through various recipes. (For more resources related to this topic, see here.) Loading data from a local machine to HDFS In this recipe, we are going to load data from a local machine's disk to HDFS. Getting ready To perform this recipe, you should have an already Hadoop running cluster. How to do it... Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS. Using the copyFromLocal commandTo copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this: hadoop fs -mkdir /mydir1 hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1 Using the put commandWe will first create the directory, and then put the local file in HDFS: hadoop fs -mkdir /mydir2 hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2 You can validate that the files have been copied to the correct folders by listing the files: hadoop fs -ls /mydir1 hadoop fs -ls /mydir2 How it works... When you use HDFS copyFromLocal or the put command, the following things will occur: First of all, the HDFS client (the command prompt, in this case) contacts NameNode because it needs to copy the file to HDFS. NameNode then asks the client to break the file into chunks of different cluster block sizes. In Hadoop 2.X, the default block size is 128MB. Based on the capacity and availability of space in DataNodes, NameNode will decide where these blocks should be copied. Then, the client starts copying data to specified DataNodes for a specific block. The blocks are copied sequentially one after another. When a single block is copied, the block is sent to DataNode into packets that are 4MB in size. With each packet, a checksum is sent; once the packet copying is done, it is verified with checksum to check whether it matches. The packets are then sent to the next DataNode where the block will be replicated. The HDFS client's responsibility is to copy the data to only the first node; the replication is taken care by respective DataNode. Thus, the data block is pipelined from one DataNode to the next. When the block copying and replication is taking place, metadata on the file is updated in NameNode by DataNode. Exporting data from HDFS to Local machine In this recipe, we are going to export/copy data from HDFS to the local machine. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Performing this recipe is as simple as copying data from one folder to the other. There are a couple of ways in which you can export data from HDFS to the local machine. Using the copyToLocal command, you'll get this code: hadoop fs -copyToLocal /mydir1/LICENSE.txt /home/ubuntu Using the get command, you'll get this code: hadoop fs -get/mydir1/LICENSE.txt /home/ubuntu How it works... When you use HDFS copyToLocal or the get command, the following things occur: First of all, the client contacts NameNode because it needs a specific file in HDFS. NameNode then checks whether such a file exists in its FSImage. If the file is not present, the error code is returned to the client. If the file exists, NameNode checks the metadata for blocks and replica placements in DataNodes. NameNode then directly points DataNode from where the blocks would be given to client one by one. The data is directly copied from DataNode to the client machine. and it never goes through NameNode to avoid bottlenecks. Thus, the file is exported to the local machine from HDFS. Changing the replication factor of an existing file in HDFS In this recipe, we are going to take a look at how to change the replication factor of a file in HDFS. The default replication factor is 3. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Sometimes. there might be a need to increase or decrease the replication factor of a specific file in HDFS. In this case, we'll use the setrep command. This is how you can use the command: hadoop fs -setrep [-R] [-w] <noOfReplicas><path> ... In this command, a path can either be a file or directory; if its a directory, then it recursively sets the replication factor for all replicas. The w option flags the command and should wait until the replication is complete The r option is accepted for backward compatibility First, let's check the replication factor of the file we copied to HDFS in the previous recipe: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 3 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt Once you list the file, it will show you the read/write permissions on this file, and the very next parameter is the replication factor. We have the replication factor set to 3 for our cluster, hence, you the number is 3. Let's change it to 2 using this command: hadoop fs -setrep -w 2 /mydir1/LICENSE.txt It will wait till the replication is adjusted. Once done, you can verify this again by running the ls command: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 2 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt How it works... Once the setrep command is executed, NameNode will be notified, and then NameNode decides whether the replicas need to be increased or decreased from certain DataNode. When you are using the –w command, sometimes, this process may take too long if the file size is too big. Setting the HDFS block size for all the files in a cluster In this recipe, we are going to take a look at how to set a block size at the cluster level. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... The HDFS block size is configurable for all files in the cluster or for a single file as well. To change the block size at the cluster level itself, we need to modify the hdfs-site.xml file. By default, the HDFS block size is 128MB. In case we want to modify this, we need to update this property, as shown in the following code. This property changes the default block size to 64MB: <property> <name>dfs.block.size</name> <value>67108864</value> <description>HDFS Block size</description> </property> If you have a multi-node Hadoop cluster, you should update this file in the nodes, that is, NameNode and DataNode. Make sure you save these changes and restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the block size for files that will now get added to the HDFS cluster. Make sure that this does not change the block size of the files that are already present in HDFS. There is no way to change the block sizes of existing files. How it works... By default, the HDFS block size is 128MB for Hadoop 2.X. Sometimes, we may want to change this default block size for optimization purposes. When this configuration is successfully updated, all the new files will be saved into blocks of this size. Ensure that these changes do not affect the files that are already present in HDFS; their block size will be defined at the time being copied. Setting the HDFS block size for a specific file in a cluster In this recipe, we are going to take a look at how to set the block size for a specific file only. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... In the previous recipe, we learned how to change the block size at the cluster level. But this is not always required. HDFS provides us with the facility to set the block size for a single file as well. The following command copies a file called myfile to HDFS, setting the block size to 1MB: hadoop fs -Ddfs.block.size=1048576 -put /home/ubuntu/myfile / Once the file is copied, you can verify whether the block size is set to 1MB and has been broken into exact chunks: hdfs fsck -blocks /myfile Connecting to namenode via http://localhost:50070/fsck?ugi=ubuntu&blocks=1&path=%2Fmyfile FSCK started by ubuntu (auth:SIMPLE) from /127.0.0.1 for path /myfile at Thu Oct 29 14:58:00 UTC 2015 .Status: HEALTHY Total size: 17276808 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 17 (avg. block size 1016282 B) Minimally replicated blocks: 17 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Thu Oct 29 14:58:00 UTC 2015 in 2 milliseconds The filesystem under path '/myfile' is HEALTHY How it works... When we specify the block size at the time of copying a file, it overwrites the default block size and copies the file to HDFS by breaking the file into chunks of a given size. Generally, these modifications are made in order to perform other optimizations. Make sure you make these changes, and you are aware of their consequences. If the block size isn't adequate enough, it will increase the parallelization, but it will also increase the load on NameNode as it would have more entries in FSImage. On the other hand, if the block size is too big, then it will reduce the parallelization and degrade the processing performance. Enabling transparent encryption for HDFS When handling sensitive data, it is always important to consider the security measures. Hadoop allows us to encrypt sensitive data that's present in HDFS. In this recipe, we are going to see how to encrypt data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... For many applications that hold sensitive data, it is very important to adhere to standards such as PCI, HIPPA, FISMA, and so on. To enable this, HDFS provides a utility called encryption zone where we can create a directory in it so that data is encrypted on writes and decrypted on read. To use this encryption facility, we first need to enable Hadoop Key Management Server (KMS): /usr/local/hadoop/sbin/kms.sh start This would start KMS in the Tomcat web server. Next, we need to append the following properties in core-site.xml and hdfs-site.xml. In core-site.xml, add the following property: <property> <name>hadoop.security.key.provider.path</name> <value>kms://http@localhost:16000/kms</value> </property> In hds-site.xml, add the following property: <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://http@localhost:16000/kms</value> </property> Restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh Now, we are all set to use KMS. Next, we need to create a key that will be used for the encryption: hadoop key create mykey This will create a key, and then, save it on KMS. Next, we have to create an encryption zone, which is a directory in HDFS where all the encrypted data is saved: hadoop fs -mkdir /zone hdfs crypto -createZone -keyName mykey -path /zone We will change the ownership to the current user: hadoop fs -chown ubuntu:ubuntu /zone If we put any file into this directory, it will encrypt and would decrypt at the time of reading: hadoop fs -put myfile /zone hadoop fs -cat /zone/myfile How it works... There can be various types of encryptions one can do in order to comply with security standards, for example, application-level encryption, database level, file level, and disk-level encryption. The HDFS transparent encryption sits between the database and file-level encryptions. KMS acts like proxy between HDFS clients and HDFS's encryption provider via HTTP REST APIs. There are two types of keys used for encryption: Encryption Zone Key( EZK) and Data Encryption Key (DEK). EZK is used to encrypt DEK, which is also called Encrypted Data Encryption Key(EDEK). This is then saved on NameNode. When a file needs to be written to the HDFS encryption zone, the client gets EDEK from NameNode and EZK from KMS to form DEK, which is used to encrypt data and store it in HDFS (the encrypted zone). When an encrypted file needs to be read, the client needs DEK, which is formed by combining EZK and EDEK. These are obtained from KMS and NameNode, respectively. Thus, encryption and decryption is automatically handled by HDFS. and the end user does not need to worry about executing this on their own. You can read more on this topic at http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-transparent-encryption-in-hdfs/. Importing data from another Hadoop cluster Sometimes, we may want to copy data from one HDFS to another either for development, testing, or production migration. In this recipe, we will learn how to copy data from one HDFS cluster to another. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Hadoop provides a utility called DistCp, which helps us copy data from one cluster to another. Using this utility is as simple as copying from one folder to another: hadoop distcp hdfs://hadoopCluster1:9000/source hdfs://hadoopCluster2:9000/target This would use a Map Reduce job to copy data from one cluster to another. You can also specify multiple source files to be copied to the target. There are couple of other options that we can also use: -update: When we use DistCp with the update option, it will copy only those files from the source that are not part of the target or differ from the target. -overwrite: When we use DistCp with the overwrite option, it overwrites the target directory with the source. How it works... When DistCp is executed, it uses map reduce to copy the data and also assists in error handling and reporting. It expands the list of source files and directories and inputs them to map tasks. When copying from multiple sources, collisions are resolved in the destination based on the option (update/overwrite) that's provided. By default, it skips if the file is already present at the target. Once the copying is complete, the count of skipped files is presented. You can read more on DistCp at https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html. Recycling deleted data from trash to HDFS In this recipe, we are going to see how recover deleted data from the trash to HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... To recover accidently deleted data from HDFS, we first need to enable the trash folder, which is not enabled by default in HDFS. This can be achieved by adding the following property to core-site.xml: <property> <name>fs.trash.interval</name> <value>120</value> </property> Then, restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the deleted file retention to 120 minutes. Now, let's try to delete a file from HDFS: hadoop fs -rmr /LICENSE.txt 15/10/30 10:26:26 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://localhost:9000/LICENSE.txt' to trash at: hdfs://localhost:9000/user/ubuntu/.Trash/Current We have 120 minutes to recover this file before it is permanently deleted from HDFS. To restore the file to its original location, we can execute the following commands. First, let's confirm whether the file exists: hadoop fs -ls /user/ubuntu/.Trash/Current Found 1 items -rw-r--r-- 1 ubuntu supergroup 15429 2015-10-30 10:26 /user/ubuntu/.Trash/Current/LICENSE.txt Now, restore the deleted file or folder; it's better to use the distcp command instead of copying each file one by one: hadoop distcp hdfs //localhost:9000/user/ubuntu/.Trash/Current/LICENSE.txt hdfs://localhost:9000/ This will start a map reduce job to restore data from the trash to the original HDFS folder. Check the HDFS path; the deleted file should be back to its original form. How it works... Enabling trash enforces the file retention policy for a specified amount of time. So, when trash is enabled, HDFS does not execute any blocks deletions or movements immediately but only updates the metadata of the file and its location. This way, we can accidently stop deleting files from HDFS; make sure that trash is enabled before experimenting with this recipe. Saving compressed data on HDFS In this recipe, we are going to take a look at how to store and process compressed data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop. How to do it... It's always good to use compression while storing data in HDFS. HDFS supports various types of compression algorithms such as LZO, BIZ2, Snappy, GZIP, and so on. Every algorithm has its own pros and cons when you consider the time taken to compress and decompress and the space efficiency. These days people prefer Snappy compression as it aims to achieve a very high speed and reasonable amount compression. We can easily store and process any number of files in HDFS. To store compressed data, we don't need to specifically make any changes to the Hadoop cluster. You can simply copy the compressed data in the same way it's in HDFS. Here is an example of this: hadoop fs -mkdir /compressed hadoop fs –put file.bz2 /compressed Now, we'll run a sample program to take a look at how Hadoop automatically uncompresses the file and processes it: hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount /compressed /compressed_out Once the job is complete, you can verify the output. How it works... Hadoop explores native libraries to find the support needed for various codecs and their implementations. Native libraries are specific to the platform that you run Hadoop on. You don't need to make any configurations changes to enable compression algorithms. As mentioned earlier, Hadoop supports various compression algorithms that are already familiar to the computer world. Based on your needs and requirements (more space or more time), you can choose your compression algorithm. Take a look at http://comphadoop.weebly.com/ for more information on this. Summary We covered major factors with respect to HDFS in this article which comprises of recipes that help us to load, extract, import, export and saving data in HDFS. It also covers enabling transparent encryption for HDFS as well adjusting block size of HDFS cluster. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] Advanced Hadoop MapReduce Administration [article] Integration with Hadoop [article]
Read more
  • 0
  • 0
  • 29624

article-image-getting-started-deep-learning
Packt
07 Mar 2016
12 min read
Save for later

Getting Started with Deep Learning

Packt
07 Mar 2016
12 min read
In this article by Joshua F. Wiley, author of the book, R Deep Learning Essentials, we will discuss deep learning, a powerful multilayered architecture for pattern recognition, signal detection, classification, and prediction. Although deep learning is not new, it has gained popularity in the past decade due to the advances in the computational capacity and new ways of efficient training models, as well as the availability of ever growing amount of data. In this article, you will learn what deep learning is. What is deep learning? To understand what deep learning is, perhaps it is easiest to start with what is meant by regular machine learning. In general terms, machine learning is devoted to developing and using algorithms that learn from raw data in order to make predictions. Prediction is a very general term. For example, predictions from machine learning may include predicting how much money a customer will spend at a given company, or whether a particular credit card purchase is fraudulent. Predictions also encompass more general pattern recognition, such as what letters are present in a given image, or whether a picture is of a horse, dog, person, face, building, and so on. Deep learning is a branch of machine learning where a multi-layered (deep) architecture is used to map the relations between inputs or observed features and the outcome. This deep architecture makes deep learning particularly suitable for handling a large number of variables and allows deep learning to generate features as part of the overall learning algorithm, rather than feature creation being a separate step. Deep learning has proven particularly effective in the fields of image recognition (including handwriting as well as photo or object classification) and natural language processing, such as recognizing speech. There are many types of machine learning algorithms. In this article, we are primarily going to focus on neural networks as these have been particularly popular in deep learning. However, this focus does not mean that it is the only technique available in machine learning or even deep learning, nor that other techniques are not valuable or even better suited, depending on the specific task. Conceptual overview of neural networks As their name suggests, neural networks draw their inspiration from neural processes and neurons in the body. Neural networks contain a series of neurons, or nodes, which are interconnected and process input. The connections between neurons are weighted, with these weights based on the function being used and learned from the data. Activation in one set of neurons and the weights (adaptively learned from the data) may then feed into other neurons, and the activation of some final neuron(s) is the prediction. To make this process more concrete, an example from human visual perception may be helpful. The term grandmother cell is used to refer to the concept that somewhere in the brain there is a cell or neuron that responds specifically to a complex and specific object, such as your grandmother. Such specificity would require thousands of cells to represent every unique entity or object we encounter. Instead, it is thought that visual perception occurs by building up more basic pieces into complex representations. For example, the following is a picture of a square: Figure 1 Rather than our visual system having cells neurons that are activated only upon seeing the gestalt, or entirety, of a square, we can have cells that recognize horizontal and vertical lines, as shown in the following: Figure 2 In this hypothetical case, there may be two neurons, one which is activated when it senses horizontal lines and another that is activated when it senses vertical lines. Finally, a higher-order process recognizes that it is seeing a square when both the lower order neurons are activated simultaneously. Neural networks share some of these same concepts, with inputs being processed by a first layer of neurons that may go on to trigger another layer. Neural networks are sometimes shown as graphical models. In Figure 3, Inputs are data represented as squares. These may be pixels in an image or different aspects of sounds, or something else. The next layer of Hidden neurons is neurons that recognize basic features, such as horizontal lines, vertical lines, or curved lines. Finally, the output may be a neuron that is activated by the simultaneous activation of two of the hidden neurons. In this article, observed data or features are depicted as squares, and unobserved or hidden layers as circles: Figure 3 Neural networks are used to refer to a broad class of models and algorithms. Hidden neurons are generated based on some combination of the observed data, similar to a basis expansion in other statistical techniques; however, rather than choosing the form of the expansion, the weights used to create the hidden neurons are learned from the data. Neural networks can involve a variety of activation function(s), which are transformations of the weighted raw data inputs to create the hidden neurons. A common choice for activation functions is the sigmoid function:  and the hyperbolic tangent function . Finally, radial basis functions are sometimes used as they are efficient function approximators. Although there are a variety of these, the Gaussian form is common: . In a shallow neural network such as is shown in Figure 3, with only a single hidden layer, from the hidden units to the outputs is essentially a standard regression or classification problem. The hidden units can be denoted by, h, the outputs by, Y. Different outputs can be denoted by subscripts i = 1, …, k and may represent different possible classifications, such as (in our case) a circle or square. The paths from each hidden unit to each output are the weights and for the ith output are denoted by wi. These weights are also learned from the data, just like the weights used to create the hidden layer. For classification, it is common to use a final transformation, the softmax function, which is   as this ensures that the estimates are positive (using the exponential function) and that the probability of being in any given class sums to one. For linear regression, the identity function, which returns its input, is commonly used. Confusion may arise as to why there are paths between every hidden unit and output as well as every input and hidden unit. These are commonly drawn to represent that a priori any of these relations are allowed to exist. The weights must then be learned from the data, with zero or near zero weights essentially equating to dropping unnecessary relations. This only scratches the surface of the conceptual and practical aspects of neural networks. For a slightly more in-depth introduction to neural networks, see Chapter 11 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009) also freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/. Next, we will turn to a brief introduction to deep neural networks. Deep neural networks Perhaps the simplest, if not the most informative, definition of a deep neural network (DNN) is that it is a neural network with multiple hidden layers. Although a relatively simple conceptual extension of neural networks, such deep architecture provides valuable advances in terms of the capability of the models and new challenges in training them. Using multiple hidden layers allows a more sophisticated build-up from simple elements to more complex ones. When discussing neural networks, we considered the outputs to be whether the object was a circle or a square. In a deep neural network, many circles and squares could be combined to form other more advanced shapes. One can consider two complexity aspects of a model's architecture. One is how wide or narrow it is—that is, how many neurons in a given layer. The second is how deep it is, or how many layers of neurons there are. For data that truly has such deep architectures, a DNN can fit it more accurately with fewer parameters than a neural network (NN), because more layers (each with fewer neurons) can be a more efficient and accurate representation; for example, because the shallow NN cannot build more advanced shapes from basic pieces, in order to provide equal accuracy to the DNN it must represent each unique object. Again considering pattern recognition in images, if we are trying to train a model for text recognition the raw data may be pixels from an image. The first layer of neurons could be trained to capture different letters of the alphabet, and then another layer could recognize sets of these letters as words. The advantage is that the second layer does not have to directly learn from the pixels, which are noisy and complex. In contrast, a shallow architecture may require far more parameters, as each hidden neuron would have to be capable of going directly from pixels in an image to a complete word, and many words may overlap, creating redundancy in the model. One of the challenges in training deep neural networks is how to efficiently learn the weights. The models are often complex and local minima abound making the optimization problem a challenging one. One of the major advancements came in 2006, when it was shown that Deep Belief Networks (DBNs) could be trained one layer at a time (Refer A Fast Learning Algorithm for Deep Belief Nets, by Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, (2006) at http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf). A DBN is a type of DNN where multiple hidden layers and connections between (but not within) layers (that is, a neuron in layer 1 may be connected to a neuron in layer 2, but may not be connected to another neuron in layer 1). This is the essentially the same definition of a Restricted Boltzmann Machine (RBM)—an example is diagrammed in Figure 4, except that a RBM typically has one input layer and one hidden layer: Figure 4 The restriction of no connections within a layer is valuable as it allows for much faster training algorithms to be used, such as the contrastive divergence algorithm. If several RBMs are stacked together, they can form a DBN. Essentially, the DBN can then be trained as a series of RBMs. The first RBM layer is trained and used to transform raw data into hidden neurons, which are then treated as a new set of inputs in a second RBM, and the process is repeated until all layers have been trained. The benefits of the realization that DBNs could be trained one layer at a time extend beyond just DBNs, however. DBNs are sometimes used as a pre-training stage for a deep neural network. This allows the comparatively fast, greedy layer-by-layer training to be used to provide good initial estimates, which are then refined in the deep neural network using other, slower, training algorithms such as back propagation. So far we have been primarily focused on feed-forward neural networks, where the results from one layer and neuron feed forward to the next. Before closing this section, two specific kinds of deep neural networks that have grown in popularity are worth mentioning. The first is a Recurrent Neural Network (RNN) where neurons send feedback signals to each other. These feedback loops allow RNNs to work well with sequences. A recent example of an application of RNNs was to automatically generate click-bait such as One trick to great hair salons don't want you to know or Top 10 reasons to visit Los Angeles: #6 will shock you!. RNNs work well for such jobs as they can be seeded from a large initial pool of a few words (even just trending search terms or names) and then predict/generate what the next word should be. This process can be repeated a few times until a short phrase is generated, the click-bait. This example is drawn from a blog post by Lars Eidnes, available at http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/. The second type is a Convolutional Neural Network (CNN). CNNs are most commonly used in image recognition. CNNs work by having each neuron respond to overlapping subregions of an image. The benefits of CNNs are that they require comparatively minimal pre-processing yet still do not require too many parameters through weight sharing (for example, across subregions of an image). This is particularly valuable for images as they are often not consistent. For example, imagine ten different people taking a picture of the same desk. Some may be closer or farther away or at positions resulting in essentially the same image having different heights, widths, and the amount of image captured around the focal object. As for neural networks, this description only provides the briefest of overviews as to what DNNs are and some of the use cases to which they can be applied. Summary This article presented a brief introduction to NNs and DNNs. Using multiple hidden layers, DNNs have been a revolution in machine learning by providing a powerful unsupervised learning and feature-extraction component that can be standalone or integrated as part of a supervised model. There are many applications of such models and they are increasingly used by large-scale companies such as Google, Microsoft, and Facebook. Examples of tasks for deep learning are image recognition (for example, automatically tagging faces or identifying keywords for an image), voice recognition, and text translation (for example, to go from English to Spanish, or vice versa). Work is being done on text recognition, such as sentiment analysis to try to identify whether a sentence or paragraph is generally positive or negative, which is particularly useful to evaluate perceptions about a product or service. Imagine being able to scrape reviews and social media for any mention of your product and analyze whether it was being discussed more favorably than the previous month or year or not! Resources for Article:   Further resources on this subject: Dealing with a Mess [article] Design with Spring AOP [article] Probability of R? [article]
Read more
  • 0
  • 0
  • 2402
article-image-introducing-openstack-trove
Packt
02 Mar 2016
17 min read
Save for later

Introducing OpenStack Trove

Packt
02 Mar 2016
17 min read
In this article, Alok Shrivastwa and Sunil Sarat, authors of the book OpenStack Trove Essentials, explain how OpenStack Trove truly and remarkably is a treasure or collection of valuable things, especially for open source lovers like us and, of course, it is an apt name for the Database as a Service (DBaaS) component of OpenStack. In this article, we shall see why this component shows the potential and is on its way to becoming one of the crucial components in the OpenStack world. In this article, we will cover the following: DBaaS and its advantages An introduction to OpenStack's Trove project and its components Database as a Service Data is a key component in today's world, and what would applications do without data? Data is very critical, especially in the case of businesses such as the financial sector, social media, e-commerce, healthcare, and streaming media. Storing and retrieving data in a manageable way is absolutely key. Databases, as we all know, have been helping us manage data for quite some time now. Databases form an integral part of any application. Also, the data-handling needs of different type of applications are different, which has given rise to an increase in the number of database types. As the overall complexity increases, it becomes increasingly challenging and difficult for the database administrators (DBAs) to manage them. DBaaS is a cloud-based service-oriented approach to offering databases on demand for storing and managing data. DBaaS offers a flexible and scalable platform that is oriented towards self-service and easy management, particularly in terms of provisioning a business' environment using a database of choice in a matter of a few clicks and in minutes rather than waiting on it for days or even, in some cases, weeks. The fundamental building block of any DBaaS is that it will be deployed over a cloud platform, be it public (AWS, Azure, and so on) or private (VMware, OpenStack, and so on). In our case, we are looking at a private cloud running OpenStack. So, to the extent necessary, you might come across references to OpenStack and its other services, on which Trove depends. XaaS (short for Anything/Everything as a Service, of which DBaaS is one such service) is fast gaining momentum. In the cloud world, everything is offered as a service, be it infrastructure, software, or, in this case, databases. Amazon Web Services (AWS) offers various services around this: the Relational Database Service (RDS) for the RDBMS (short for relational database management system) kind of system; SimpleDB and DynamoDB for NoSQL databases; and Redshift for data warehousing needs. The OpenStack world was also not untouched by the growing demand for DBaaS, not just by users but also by DBAs, and as a result, Trove made its debut with the OpenStack release Icehouse in April 2014 and since then is one of the most popular advanced services of OpenStack. It supports several SQL and NoSQL databases and provides the full life cycle management of the databases. Advantages Now, you must be wondering why we must even consider DBaaS over traditional database management strategies. Here are a few points you might want to consider that might make it worth your time. Reduced database management costs In any organization, most of their DBAs' time is wasted in mundane tasks such as creating databases, creating instances, and so on. They are not able to concentrate on tasks such as fine-tuning SQL queries so that applications run faster, not to mention the time taken to do it all manually (or with a bunch of scripts that need to be fired manually), so this in effect is wasting resources in terms of both developers' and DBAs' time. This can be significantly reduced using a DBaaS. Faster provisioning and standardization With DBaaS, databases that are provisioned by the system will be compliant with standards as there is very little human intervention involved. This is especially helpful in the case of heavily regulated industries. As an example, let's look at members of the healthcare industry. They are bound by regulations such as HIPAA (short for Health Insurance Portability and Accountability Act of 1996), which enforces certain controls on how data is to be stored and managed. Given this scenario, DBaaS makes the database provisioning process easy and compliant as they only need to qualify the process once, and then every other database coming out of the automated provisioning system is then compliant with the standards or controls set. Easier administration Since DBaaS is cloud based, which means there will be a lot of automation, administration becomes that much more automated and easier. Some important administration tasks are backup/recovery and software upgrade/downgrade management. As an example, with most databases, we should be able to push configuration modifications within minutes to all the database instances that have been spun out by the DBaaS system. This ensures that any new standards being thought of can easily be implemented. Scaling and efficiency Scaling (up or down) becomes immensely easy, and this reduces resource hogging, which developers used as part of their planning for a rainy day, and in most cases, it never came. In the case of DBaaS, since you don't commit resources upfront and only scale up or down as and when necessary, resource utilization will be highly efficient. These are some of the advantages available to organizations that use DBaaS. Some of the concerns and roadblocks for organizations in adopting DBaaS, especially in a public cloud model, are as follows: Companies don't want to have sensitive data leave their premises. Database access and speed are key to application performance. Not being able to manage the underlying infrastructure inhibits some organizations from going to a DBaaS model. In contrast to public cloud-based DBaaS, concerns regarding data security, performance, and visibility reduce significantly in the case of private DBaaS systems such as Trove. In addition, the benefits of a cloud environment are not lost either. Trove OpenStack Trove, which was originally called Red Dwarf, is a project that was initiated by HP, and many others contributed to it later on, including Rackspace. The project was in incubation till the Havana release of OpenStack. It was formally introduced in the Icehouse release in April 2014, and its mission is to provide scalable and reliable cloud DBaaS provisioning functionality for relational and non-relational database engines. As of the Liberty release, Trove is considered as a big-tent service. Big-tent is a new approach that allows projects to enter the OpenStack code namespace. In order for a service to be a big-tent service, it only needs to follow some basic rules, which are listed here. This allows the projects to have access to the shared teams in OpenStack, such as the infrastructure teams, release management teams, and documentation teams. The project should: Align with the OpenStack mission Subject itself to the rulings of the OpenStack Technical Committee Support Keystone authentication Be completely open source and open community based At the time of writing the article, the adoption and maturity levels are as shown here: The previous diagram shows that the Age of the project is just 2 YRS and it has a 27% Adoption rate, meaning 27 of 100 people running OpenStack also run Trove. The maturity index is 1 on a scale of 1 to 5. It is derived from the following five aspects: The presence of an installation guide Whether the Adoption percentage is greater or lesser than 75 Stable branches of the project Whether it supports seven or more SDKs Corporate diversity in the team working on the project Without further ado, let's take a look at the architecture that Trove implements in order to provide DBaaS. Architecture The trove project uses some shared components and some dedicated project-related components as mentioned in the following subsections. Shared components The Trove system shares two components with the other OpenStack projects, the backend database (MySQL/MariaDB), and the message bus. The message bus The AMQP (short for Advanced Message Queuing Protocol) message bus brokers the interactions between the task manager, API, guest agent, and conductor. This component ensures that Trove can be installed and configured as a distributed system. MySQL/MariaDB MySQL or MariaDB is used by Trove to store the state of the system. API This component is responsible for providing the RESTful API with JSON and XML support. This component can be called the face of Trove to the external world since all the other components talk to Trove using this. It talks to the task manager for complex tasks, but it can also talk to the guest agent directly to perform simple tasks, such as retrieving users. The task manager The task manager is the engine responsible for doing the majority of the work. It is responsible for provisioning instances, managing the life cycle, and performing different operations. The task manager normally sends common commands, which are of an abstract nature; it is the responsibility of the guest agent to read them and issue database-specific commands in order to execute them. The guest agent The guest agent runs inside the Nova instances that are used to run the database engines. The agent listens to the messaging bus for the topic and is responsible for actually translating and executing the commands that are sent to it by the task manager component for the particular datastore. Let's also look at the different types of guest agents that are required depending on the database engine that needs to be supported. The different guest agents (for example, the MySQL and PostgreSQL guest agents) may even have different capabilities depending on what is supported on the particular database. This way, different datastores with different capabilities can be supported, and the system is kept extensible. The conductor The conductor component is responsible for updating the Trove backend database with the information that the guest agent sends regarding the instances. It eliminates the need for direct database access by all the guest agents for updating information. This is like the way the guest agent also listens to the topic on the messaging bus and performs its functions based on it. The following diagram can be used to illustrate the different components of Trove and also their interaction with the dependent services: Terminology Let's take a look at some of the terminology that Trove uses. Datastore Datastore is the term used for the RDBMS or NoSQL database that Trove can manage; it is nothing more than an abstraction of the underlying database engine, for example, MySQL, MongoDB, Percona, Couchbase, and so on. Datastore version This is linked to the datastore and defines a set of packages to be installed or already installed on an image. As an example, let's take MySQL 5.5. The datastore version will also link to a base image (operating system) that is stored in Glance. The configuration parameters that can be modified are also dependent on the datastore and the datastore version. Instance An instance is an instantiation of a datastore version. It runs on OpenStack Nova and uses Cinder for persistent storage. It has a full OS and additionally has the guest agent of Trove. Configuration group A configuration group is a bunch of options that you can set. As an example, we can create a group and associate a number of instances to one configuration group, thereby maintaining the configurations in sync. Flavor The flavor is similar to the Nova machine flavor, but it is just a definition of memory and CPU requirements for the instance that will run and host the databases. Normally, it's a good idea to have a high memory-to-CPU ratio as a flavor for running database instances. Database This is the actual database that the users consume. Several databases can run in a single Trove instance. This is where the actual users or applications connect with their database clients. The following diagram shows these different terminologies, as a quick summary. Users or applications connect to databases, which reside in instances. The instances run in Nova but are instantiations of the Datastore version belonging to a Datastore. Just to explain this a little further, say we have two versions of MySQL that are being serviced. We will have one datastore but two datastore versions, and any instantiation of that will be called an instance, and the actual MySQL database that will be used by the application will be called the database (shown as DB in the diagram). A multi-datastore scenario One of the important features of the Trove system is that it supports multiple databases to various degrees. In this subsection, we will see how Trove works with multiple Trove datastores. In the following diagram, we have represented all the components of Trove (the API, task manager, and conductor) except the Guest Agent databases as Trove Controller. The Guest Agent code is different for every datastore that needs to be supported and the Guest Agent for that particular datastore is installed on the corresponding image of the datastore version. The guest agents by default have to implement some of the basic actions for the datastore, namely, create, resize, and delete, and individual guest agents have extensions that enable them to support additional features just for that datastore. The following diagram should help us understand the command proxy function of the guest agent. Please note that the commands shown are only indicative, and the actual commands will vary. At the time of writing this article, Trove's guest agents are installable only on Linux; hence, only databases on Linux systems are supported. Feature requests (https://blueprints.launchpad.net/trove/+spec/mssql-server-db-support) were created for the ability to create a guest agent for Windows and support Microsoft SQL databases, but they have not yet been approved at the time of writing this and might be a remote possibility. Database software distribution support Trove supports various databases; the following table shows the databases supported by this service at the time of writing this. Automated installation is available for all the different databases, but there is some level of difference when it comes to the configuration capabilities of Trove with respect to different databases. This has lot to do with the lack of a common configuration base among the different databases. At the time of writing this article, MySQL and MariaDB have the most configuration options available, as shown in this list: Database Version MySQL 5.5, 5.6 Percona 5.5, 5.6 MariaDB 5.5, 10.0 Couchbase 2.2, 3.0 Cassandra 2.1 Redis 2.8 PostgreSQL 9.3, 9.4 MongoDB 2.6, 3.0 DB2 Expre 10.5 CouchDB 1.6 So, as you can see, almost all the major database applications that can run on Linux are already supported on Trove. Putting it all together Now that we have understood the architecture and terminologies, we will take a look at the general steps that are followed: Horizon/Trove CLI requests a new database instance and passes the datastore name and version, along with the flavor ID and volume size as mandatory parameters. Optional parameters such as the configuration group, AZ, replica-of, and so on can also be passed. The Trove API requests Nova for an instance with the particular image and a Cinder volume of a specific size to be added to the instance. The Nova instance boots and follows these steps: The cloud-init scripts are run(like all other Nova instances) The configuration files (for example, trove-guestagent.conf) are copied down to the instance The guest agent is installed The Trove API will also have sent the request to the task manager, which will then send the prepare call to the message bus topic. After booting, the guest agent listens to the message bus for any activities for it to do, and once it finds a message for itself, it processes the prepare command and performs the following functions: Installing the database distribution (if not already installed on the image) Creating the configuration file with the default configuration for the database engine (and any configuration from the configuration groups associated overriding the defaults) Starting the database engine and enabling auto-start Polling the database engine for availability (until the database engine is available or the timeout is reached) Reporting the status back to the Trove backend using the Trove conductor The Trove manager reports back to the API and the status of the machine is changed. Use cases So, if you are wondering all the places where we can use Trove, it fits in rather nicely with the following use cases. Dev/test databases Dev/test databases are an absolute killer feature, and almost all companies that start using Trove will definitely use it for their dev/test environments. This provides developers with the ability to freely create and dispose of database instances at will. This ability helps them be more productive and removes any lag from when they want it to when they get it. The capability of being able to take a backup, run a database, and restore the backup to another server is especially key when it comes to these kinds of workloads. Web application databases Trove is used in production for any database that supports low-risk applications, such as some web applications. With the introduction of different redundancy mechanisms, such as master-slave in MySQL, this is becoming more suited to many production environments. Features Trove is moving fast in terms of the features being added in the various releases. In this section, we will take a look at the features of three releases: the current release and the past two. The Juno release The Juno release saw a lot of features being added to the Trove system. Here is a non-exhaustive list: Support for Neutron: Now we can use both nova-network and Neutron for networking purposes Replication: MySQL master/slave replication was added. The API also allowed us to detach a slave for it to be promoted Clustering: Mongo DB cluster support was added Configuration group improvements: The functionality of using a default configuration group for a datastore version was added. This allows us to build the datastore version with a base configuration of your company standards Basic error checking was added to configuration groups The Kilo release The Kilo release majorly worked on introducing a new datastore. The following is the list of major features that were introduced: Support for the GTID (short for global transaction identifier) replication strategy New datastores, namely Vertica, DB2, and CouchDB, are supported The Liberty release The Liberty release introduced the following features to Trove. This is a non-exhaustive list. Configuration groups for Redis and MongoDB Cluster support for Redis and MongoDB Percona XtraDB cluster support Backup and restore for a single instance of MongoDB User and database management for Mongo DB Horizon support for database clusters A management API for datastores and versions The ability to deploy Trove instances in a single admin tenant so that the Nova instances are hidden from the user In order to see all the features introduced in the releases, please look at the release notes of the system, which can be found at these URLs: Juno : https://wiki.openstack.org/wiki/ReleaseNotes/Juno Kilo : https://wiki.openstack.org/wiki/ReleaseNotes/Kilo Liberty : https://wiki.openstack.org/wiki/ReleaseNotes/Liberty Summary In this article, we were introduced to the basic concepts of DBaaS and how Trove can help with this. With several changes being introduced and a score of one on five with respect to maturity, it might seem as if it is too early to adopt Trove. However, a lot of companies are giving Trove a go in their dev/test environments as well as for some web databases in production, which is why the adoption percentage is steadily on the rise. A few companies that are using Trove today are giants such as eBay, who run their dev/test Test databases on Trove; HP Helion Cloud, Rackspace Cloud, and Tesora (which is also one of the biggest contributors to the project) have DBaaS offerings based on the Trove component. Trove is increasingly being used in various companies, and it is helping in reducing DBAs' mundane work and improving standardization. Resources for Article: Further resources on this subject: OpenStack Performance, Availability [article] Concepts for OpenStack [article] Implementing OpenStack Networking and Security [article]
Read more
  • 0
  • 0
  • 3515

article-image-breaking-bank
Packt
01 Mar 2016
32 min read
Save for later

Breaking the Bank

Packt
01 Mar 2016
32 min read
In this article by Jon Jenkins, author of the book Learning Xero, covers the Xero core bank functionalities, including one of the most innovative tools of our time: automated bank feeds. We will walk through how to set up the different types of bank account you may have and the most efficient way to reconcile your bank accounts. If they don't reconcile, you will be shown how you can spot and correct any errors. Automated bank feeds have revolutionized the way in which a bank reconciliation is carried out and the speed at which you can do it. Thanks to Xero, there is no longer an excuse to not keep on top of your bank accounts and therefore maintain accurate and up-to-date information for your business's key decision makers. These are the topics we'll be covering in this article: Setting up a bank feed Using rules to speed up the process Completing a bank reconciliation Dealing with common bank reconciliation errors (For more resources related to this topic, see here.) Bank overview Reconciling bank accounts has never been as easy or as quick, and we are only just at the beginning of this journey as Xero continues to push the envelope. Xero is working with banks to not only bring bank data into your accounting system but to also push it back again. That's right; you could mark a supplier invoice paid in Xero and it could actually send the payment from your bank account. Dashboard When you log in to Xero, you are presented with the dashboard, which gives a great overview of what is going on within the business. It is also an excellent place to navigate to the main parts of Xero that you will need. If you have several bank accounts, the summary pane that shows the cash in and out during a month, as shown below, is very useful as you can hover over the bar chart to get a quick snapshot of the total cash in and out for the month with no effort at all. If you want the chart to sit at the top of your dashboard, click on Edit Dashboard at the bottom of the page, and drag and drop the chart. When finished, click on Done at the bottom of the page to lock them in place. Reconciling bank accounts is fundamental to good bookkeeping, and only once the accounts have been reconciled do you know that your records are up-to-date. It isn't worth spending lots of time looking at reports if the bank accounts haven't been reconciled, as there may be important items missing from your records. By default, all bank accounts added will be shown on your dashboard, which shows the main account information, the number of unreconciled items, and the balance per Xero and per statement. You may wish to just see a few key bank accounts, in which case you can turn some of them off by going to Accounts | Bank Accounts, where you will see a list of all bank accounts. Here, you can choose to remove the bank accounts showing on the dashboard by unchecking the Show account on Dashboard option. You can also choose the Change order option, which allows you to decide the order in which you see the bank accounts on the dashboard. Click on the up and down arrows to move the accounts as required. Bank feeds If you did not set up a bank account when you were setting up Xero, then we recommend you do that now, as you cannot set up a bank feed without one. You can do this from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts | Bank Accounts and Add Bank Account. Then, you are presented with of list of options including Bank Account, Credit Card, or PayPal. Enter your account details as requested and click on Save. It is very important at this stage to note that you may be presented with several options for your bank as they offer different accounts. If you choose the wrong one, your feed will not work. Some banks charge for a direct bank feed; you do not have to adhere to this, so ignore the feeds ending with Direct Bank Feed and select the alternative one. The difference between a Direct Bank Feed and the Yodlee service that Xero uses is that the data comes directly from the bank and not via a third party, so is deemed to be more reliable. Now that you have a bank account, you can add the feed by clicking on the Get bank feeds button, as shown in the following screenshot: On the Activate Bank Feed page, the fields will vary depending on the bank you use and the type of account you selected earlier. Enter the User Credentials as requested. You will then see a screen with the text Connecting to Bank, which states it might take a few minutes, so please bear with it; you are almost there. When prompted, select the bank account from the dropdown called Select the matching bank feed... that matches the account you are adding and choose whether you wish to import from a certain date or collect as much data as possible. How far it goes back varies by bank. If you are converting from another system, it would be wise to use the conversion date as the date from which you wish to start transactions, in order to avoid bringing in transactions processed in your old system (that is if your conversion date was May 31, 2015, you would use June 1, 2015. Once you are happy with the information provided, click on OK. If you have several accounts, such as a savings account, then simply follow the process again for each account. Refresh feed Each bank feed is different, and some will automatically update; others, however, require refreshing. You can usually tell which bank accounts require a manual refresh, as they will show the following message at the bottom of the bank account tile on the dashboard: To refresh the feed from the dashboard, find the account to update and click the Manage button in the top right-hand corner and then Refresh Bank Feed. You can also do this from within Accounts | Bank Accounts | Manage Accounts | Refresh Bank Feed. To update your bank feed, you will need to refresh the feed each time you want to reconcile the bank account. Import statements You get over most disappointments in life, unlike when you find out that the bank account you have does not have a bank feed available. Your options here are simple: go and change banks. But if that is too much hassle for you, then you could always just import a file. Xero accepts OFX, QIF, and CSV. Should your bank offer a selection, go with them in this order. OFX and QIF files are formatted and should import without too many problems. CSV, on the other hand, is a different matter. Each bank CSV download will come in a different format and will need some work before it is ready for importing. This takes some time, so I would recommend using the Xero Help Guide and searching Import a Bank Statement to get the file formatted correctly. If you only do things on a monthly basis, uploading a statement is not too much of a chore. We would say at this point that the automated bank feed is one of the most revolutionary things to come out of accounting software, so not using it is a crime. You simply are not enjoying the full benefits of using Xero and cloud software without it. Petty cash It probably costs the business more to find missing petty cash discrepancies than the discrepancy totals. Our advice is simple: try not to maintain a petty cash account if you can; it is just one more thing to reconcile and manage. We would advocate using a company debit card where possible, as the transactions will then go through your main bank account and you will know what is missing. Get staff to submit an expense claim, and if that is too much hassle, treat payments in and out as if they have gone through the director's loan account, as that is what happens in most small businesses. Should you wish to operate a petty cash account, you will need to mark the transactions as reconciled manually, as there is no feed to match bank payments and receipts against. In order to do this, you must first turn on the ability to mark transactions as reconciled manually. This can be found hiding in the Help section. When you click on Help, you should then see an option to Enable Mark as Reconciled, which you will need to click to turn on this functionality. Now that you have the ability to Mark as Reconciled, you can begin to reconcile the petty cash. Go to Manage | Reconcile Account. You will be presented with the four tabs below (if Cash Coding is not turned on for your user role, you will not see that tab). The Reconcile tab should be blank, as there is no feed or import in place. You will want to go to Account transactions, which is where the items you have marked as paid from petty cash will live. Underneath this section, you will also find a button called New Transactions, where you can create transactions on the fly that you may have missed or are not going to raise a sales invoice or supplier bill for. You can see from the following screenshot that we have added an example of a Spend Money transaction, but you can also create a Receive Money transaction when clicking on the New Transaction button. Click on Save when you have finished entering the details of your transaction. If you have outstanding sales invoices or purchase invoices that have been paid through the petty cash account, then you will need to mark them as paid using the petty cash bank account in order for them to appear in the Account Transactions section. To do this, navigate to those items and then complete the Make a Payment or Receive a Payment section in the bottom left-hand corner. From the main Account transactions screen, you can then mark your transactions as reconciled. Check off the items you wish to reconcile using the checkboxes on the left, then click on More | Mark as Reconciled. When you have completed the process, your bank account balance in Xero should match that of your petty cash sheet. The status of transactions marked as Reconciled Manually will change to Reconciled in black. When a transaction is reconciled, it has come in from either an import or a bank feed, and it will be green. If it is unreconciled, it will be orange. Loan account A loan account works in the same way as a bank account, and we would recommend that you set it up if a feed is available from your bank. Managing loans in Xero is easy, as you can set up a bank rule for the interest and use the Transfer facility to reconcile any loan repayments. Credit cards Like adding bank accounts, you can add credit cards from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts |Bank Accounts |Add Bank Account | Credit Card. Add a card You may see several options for the bank you use, so double-check you are using the right option or the feed will not work. If there is no feed set up for your particular bank or credit card account, you will be notified as follows: In this case, you will need to either import a statement in either the OFX, QIF, or CSV format, or manually reconcile your credit card account. The process is the same as that detailed above for reconciling petty cash and will be matched against your credit card statement. If a feed is available, enter the login credentials requested to set up the feed in the same fashion as when adding a new bank account feed. Common issues Credit cards act in a different way than a bank account, in that each card is an account on its own, separate from the main account on which interest and balance payments are allocated. This means that even if you have just one business credit card, you will, in effect, have two accounts with the bank. You can add a feed for each account if you wish, but for the main account, the only transactions that will go through it are any interest accrued and balance payments to clear the card. We would suggest you set up the credit card account as a feed, as this is where you will see most transactions and therefore save the most time in processing. Each time interest is accrued, you will need to post it as a Spend Money transaction, and each time you make a payment, the amount will be a Transfer from one of your other accounts. Both these transactions will need to be marked as Reconciled Manually, as they will not appear on your credit card feed setup. This is done in the same way as oultlined in the Petty cash section earlier PayPal Just like a bank account, you can sync Xero with your PayPal account, even in multiple currencies. The ability to do this, coupled with using bank rules, can help supercharge your processing ability and cut down on posting errors. Add a feed There is a little bit of configuration required to set up a PayPal account. Go to Accounts | Bank Accounts | Add Bank Account | PayPal. Add the account name as you wish for it to appear in Xero and the currency for the account. To use the feed (why wouldn't you?), check  the Set up automatic PayPal import option, which will then bring up the other options shown in the following screenshot: As previously suggested, if converting from another system, then import transactions from the conversion dates, as with all previous transactions, should be dealt with in your old accounting system. Click on Save, and you will receive an e-mail from Xero to confirm your PayPal e-mail address. Click on the activation link in the e-mail. To complete the setup process, you need to update your PayPal settings in order for Xero to turn on the automatic feeds. In PayPal, go to My Account | Profile | My Selling Tools. Next to the API access, click on Update | Option 1. The box should then be Grant API Permission. In the Third Party Permission field, enter paypal_api1.xero.com, then click on Lookup. Under Available Permissions, make sure you check the following options: Click on Add, and you have finished the setup process. If you have multiple currency accounts, then complete this process for each currency. Bank rules Bank rules give you the ability to automate the processing of recurring payments and receipts based on your chosen criteria, and they can be a massive time saver if you have many transactions to deal with. An example would be the processing of PayPal fees on your PayPal account. Rather than having to process each one, you could set up a bank rule to deal with the transaction. Bank rules cannot be used for bank transfers or allocating supplier payment or customer receipts. Add bank rules You can add a bank rule directly from the bank statement line by clicking on Create rule above the bank line details. This means waiting for something to come through the bank first, which we think makes sense, as that detail is taken into consideration when setting up the bank rule, making it simpler to set up. You can also enter bank rules you know will need adding by going to the relevant bank account and clicking Manage Account | Bank Rules | Create Rule. We have broken the bank rule down into different sections. Section 1 allows you to set the conditions that must be present in order for the bank rule to trigger. Using equals means the payee or description in this example must match exactly. If you were to change it to contains, then only part of the description need be present. This can be very useful when the description contains a reference number that changes each month. You do not want the bank rule to fail, so you might choose to remove the reference number and change the condition to contains instead. You must set at least one condition. Section 2 allows you to set a contact, which we suggest you do; otherwise, you will have to do this on the Bank Account screen each time before being able to reconcile that bank statement line. Section 3 allows you to fix a value to an account code. This can be useful if the bank rule you are setting up contains an element of a fixed amount and variable amount. An example might be a telephone bill where the line rental is fixed and the balance is for call charges that will vary month by month. Section 4 allows you to allocate a percentage to an account code. If there is not a fixed value amount in section 3, you can just use section 4 and post 100% of the cost to the account code of your choice. Likewise, if you had a bank statement line that you wanted to split between account codes, then you could do so by entering a second line and using the percentage column. Section 5 allows you to set a reference to be used when the bank rule runs and there are five options. We would suggest not using the by me during bank rec option, as this again creates extra work, since you will have to fill it in each time before you can reconcile that bank statement line. Section 6 allows you to choose which bank account you want the bank rule to run on. This is useful if you start paying for an item out of a different bank account, as you can edit the rule and change the bank account rather than having to create the rule all over again. Section 7 allows you to set a title for the bank rule. Use something that will make it easy for you to remember when on the Bank Rules screen. Edit bank rules If your bank rules are not firing the way you expected or at all, then you will want to edit them to get them right. It is worth spending time in this area, as once you have mastered setting up bank rules, they will save you time. To edit a bank rule, you will need to navigate to Accounts | Bank Accounts | Manage Accounts | Bank Rules. Click on the bank rule you wish to edit, make the necessary adjustments, and then click on Save. You will know if the bank rule is working, as it appears like the following when reconciling your bank account. If you do not wish to apply the rule, you can click on Don't apply rule in the bottom-left corner, or if you wish to check what it is going to do, click on View details first to verify the bank rule is going to post where you prefer. Re-order bank rules The order in which your bank rules sit is the order in which Xero runs them. This is important to remember if you have bank rules set up that may conflict with each other and not return the result you were expecting. An example might be you purchasing different items from a supermarket, such as petrol and stationery. In this example, we will call it Xeroco. In most instances, the bank statement line will show two different descriptions or references, in this case Xeroco for the main store and Xeroco Fuel for the gas station. You will need to set up your rules carefully, as using only contains for Xeroco will mean that your postings could end up going to the wrong account code. You would want the Xeroco Fuel bank rule to sit above the Xeroco rule. Because they both contain the same description, if Xeroco was first, it would always trigger and everything would get posted to stationery, including the petrol. If you set Xeroco Fuel as the first bank rule to run if the bank statement line does not contain both words, it will continue and then run the Xeroco rule, which would prove successful. Gas will get posted to fuel and stationery will get posted to stationery. You can drag and drop bank rules to change the order in which they run. Hover over the two dots to the left of the bank rule number and you can drag them to the appropriate position. Bank reconciliation Bank reconciliation is one of the main drivers in knowing when your books and records are up-to-date. The introduction of automated bank feeds has revolutionized the way in which we can complete a bank reconciliation, which is the process of matching what has gone through the bank account and what has been posted in your accounting system. Below are some ways to utilize all the functionality in the Xero toolkit. Auto Suggest Xero is an intuitive system; it learns how you process transactions and is also able to make suggestions based on what you have posted. As shown below, Xero has found a match for the bank statement line on the left, which is why the right-hand panel is now green and you can see the OK button to reconcile the transaction, provided you are happy it is the correct selection. You can choose to turn Auto Suggest off. At the bottom of each page in the bank screen, you will find a checkbox, as shown in the following screenshot. Simply uncheck the Suggest previous entries box to turn it off. The more you use Xero, the better it learns, so we would advise sticking with it. It is not a substitute for checking, however, so please check before hitting the OK button. Find & Match When Xero cannot find a match and you know the bank statement line in question probably has an associated invoice or bill in the system, you can use Find & Match in the upper right-hand corner, as shown in the following screenshot: You can look through the list of unreconciled bank transactions shown in the panel or you can opt to use the Search facility and search by name, reference, or amount. In this example, you can see that we have now found two transactions from SMART Agency that total the £4,500 spent. As you can see in the following screenshot, there is also an option next to the monetary amounts that will allow you to split the transaction. This is useful if someone has not paid the invoice in full. If the amount received was only £2,500 in total, for example, you could use Split to allocate £1,000 against the first transaction and £1,500 against the second transaction. When checked off, these turn green, and you can click on OK to reconcile that bank statement line. If you cannot find a match, you will need to investigate what it relates to and if you are missing some paperwork. If we had been clever when making the original supplier payment, we could have used the Batch Payment option in Accounts |Purchases |Awaiting Payment, checking off the items that make up the amount paid, and Batch Payment would have enabled us to tell Xero that there was a payment made totaling £4,500. Auto Suggest would have picked this up, making the reconciliation easier and quicker. You have already done the hard bit by working out how much to pay suppliers; you don't want to have to do it again when an amount comes through the bank and you can't remember what it was for. It is also good practice, as it means you will not inadvertently pay the same supplier again since the bill will be marked as paid. The same can be done for customer invoices using the Deposit button. This is very helpful when receiving check deposits or remittance advice well in advance of the actual receipt. By marking the invoices as paid, you will not chase customers for money, unnecessarily causing bad feelings along the way. Create There will be occasions when you will not have a bill or invoice in Xero from which to reconcile the bank statement line. In these situations, you will need to create a transaction to clear the bank statement line. In the example below, you can see that we have entered who the contact is, what the cost relates to, and added why we spent the money. Xero will now allow us to clear the bank statement line, as the OK button is visible. We would suggest that this option be used sparingly, as you should have paperwork posted into Xero in the form of a bill or invoice to deal with the majority of your bank statement lines. Bank transfers When you receive money in or transfer money out to another bank account, you have set up within Xero a very simple way to deal with those transactions. Click on the Transfer tab and choose the bank account from the dropdown. You will then be able to reconcile that bank statement line. In the account that you have made the transfer to, you will find that Xero will make the auto suggest for you when you reconcile that bank account. Discuss and comments When you are performing the bank reconciliation, you may find that you get stuck and you genuinely do not know what to do with it. This is where the Discuss tab, shown in the following screenshot, can help: You can simply enter a note to yourself for someone else in the business or for your advisor to take a look at. Don't forget to click on Save when you are done. If someone can answer your query, they can then enter their comment in the Discuss tab and save it. Note that at present there is no notification process when you save a comment in the Discuss tab, so you are reliant on someone regularly checking it. You will see something similar to the note underneath the business name when you log in to Xero, so you can see there is a comment that needs action. Reconciliation Report This is the major tool in your armory to check whether your bank reconciles. There is no greater feeling in life than your bank account reconciling and there being no unpresented items left hanging around. To run the report from within a bank account, click on the Reconciliation Report button next to the Manage Account button. From here, you can choose which bank account you wish to run the report from, so you do not need to keep moving between the accounts, and also a date option as you will probably want to run the report to various dates, especially if you encounter a problem. On the reconciliation report, you will see the balance in Xero, which is the cashbook balance (that is what would be in the bank if all the outstanding payments and receipts posted in Xero cleared and all the bank statement lines that have come from the bank feed were processed). The outstanding payments and receipts are invoices and bills you have marked as paid in Xero but have not been matched to a bank statement line yet. You need to keep an eye on these, as older unreconciled items would normally indicate a misallocation or uncleared item. Plus Un-Reconciled Bank Statement Lines are those items that have come through on a feed but have not yet been allocated to something in Xero. This might mean that there are missing bills or sales invoices in Xero, for example. Statement Balance is the number that should match what is on your online or paper statement, whether it is a bank account, credit card, or PayPal account. If the figures do not match, then it will need investigating. In the next section, we have highlighted some of the things that may have caused the imbalance and some ideas of what to do to rectify the situation. Manual reconciliation If you are unable to set up a bank feed or import a statement, then you can still reconcile bank accounts in Xero; it just feels a bit like going back in time. To complete a manual reconciliation, you will need to follow the same process as used to process petty cash, as discussed earlier in this article. Common errors and corrections Despite all the innovation and technological advances Xero has made in this area, there are still things that can go wrong—some human, some machine. The main thing is to recognize this and know how to deal with it in the event that it happens. We have highlighted some of the more common issues and resolutions in the following subsections. Account not reconciling There is no greater feeling than when you get that big green checkmark telling you that you have reconciled all your transactions. Fantastic job done, you think! But not quite. You need to check your bank statement, credit card statement, loan statement, or PayPal statement to make sure it definitely matches as per the preceding bank reconciliation report section. The job's not done until you have completed the manual check. Duplicated statement lines With all things technology, there is a chance that things can go wrong, and every now and again, you may find that your bank feed has duplicated a line item. This is why it is so important to check the actual statement against that in Xero. It is the only way to truly know if the accounts reconcile. A direct bank feed that costs money is deemed to be more robust, and some feeds through Yodlee are better than others. It all depends on your bank, so it is worth checking with the Xero community to get some guidance from fellow Xeroes. If you are convinced that you have a duplicated bank statement line, you can choose to delete it by finding the offending item in your account and clicking on the cross in the top left-hand corner of the bank statement line. When you hover over the cross, the item will turn red. Use this sparingly and only when you know you have a duplicated line. Missing statement lines As with duplicated statement lines, there is also the possibility of a bank statement line not being synced, and this can be picked up when checking that the Xero bank account figure matches that of your online or paper bank statement. If they do not match, then we recommend using the reconciliation report and working backwards month by month and then week by week to try and isolate the date at which the bank last reconciled. Once you have narrowed it down to a week, you can then start doing it day by day until you find the date, and then check off the items in Xero against those on your bank statement until you find the missing items. If there are several missing items, we would probably suggest doing an import via OFX, QIF, or CSV, but if there are only a few, then it would probably be best to enter them manually and then mark them as reconciled so the bank will reconcile. Remove & Redo We know you are great at what you do, but everyone has an off day. If you have allocated something incorrectly, you can easily amend it. auto suggest is fantastic, but you may get carried away and just keep hitting that the OK button without paying enough attention. This is particularly problematic for businesses dealing with lots of invoices for similar amounts. If you do find that you have made an error, then you can remove the original allocation and redo it. You can do this by going to Accounts | Bank Accounts | Manage Account | Reconcile Account | Account Transactions. As you can see in the following screenshot, once you have found the offending item, you can check it off and then click on Remove & Redo. You will also find on the right-hand side a Search button, which will allow you to search for particular transactions rather than having to scroll through endless pages. If you happen to be in the actual bank transaction when you identify a problem, then you can click on Options | Remove & Redo. This will then push the transaction back in to the bank account screen to reconcile again. Note that if you Remove & Redo a manually entered bank transaction, it will not reappear in the bank account for you to reconcile, as it was never there in the first place. What you will need to do is post the payment or receipt against the correct invoice or bill, and then manually mark it as reconciled again or create another spend or receive money transaction. Manually marked as reconciled A telltale sign that someone has inadvertently caused a problem is when you look at the Account Transactions tab in the bank account and there is a sea of green and reconciled statuses, and then you spot the odd black reconciled status. This is an indication that something has been posted to that account and marked as reconciled manually. This will need investigating, as it may be genuine, such as some missing bank statement lines, or it could be that someone has made a mistake and it needs to be removed. Understanding reports If the bank account does not reconcile and it is not something obvious, then we would suggest looking at the bank statements imported into Xero to see if there are any obvious problems. Go to Accounts | Bank Accounts | Manage Account | Bank Statements. From this screen, have a look to see if there is any overlap of dates imported and then drill into the statements to check for anything that doesn't look right. If you do come across duplicated lines, you can remove them by checking off the box on the left and then clicking on the Delete button. You can see below that the bank statement line has been grayed out and has a status of Deleted next to it. If you later discover that you have made a mistake, then you can restore the bank statement line by checking off the box again but clicking on Restore this time. If you have incorrectly imported a duplicate statement or the bank feed has done so rather than deleting the transactions, you can choose to delete the entire statement. This can be achieved by clicking on the Delete Entire Statement button at the bottom-left of the Bank Statements screen: Make sure you have checked the Also delete reconciled transactions for this statement option before clicking Delete. If you are deleting the statement because it is incorrect, it only makes sense that you also clear any transactions associated with this statement to avoid further issues. Summary We have successfully added bank feeds that are now ready for automating the bank reconciliation process. In this article, we ran through the major bank functions and set up your bank feeds, exploring how to set up bank rules to make the bank reconciliation task even easier and quicker. On top of that, we also explored the possibilities of what could go wrong, but more importantly, how to identify errors and put them right. One of the biggest bookkeeping tasks you will undertake should now seem a lot easier. Resources for Article:   Further resources on this subject: Probability of R? [article] Dealing with a Mess [article] Essbase ASO (Aggregate Storage Option) [article]
Read more
  • 0
  • 0
  • 4591
Modal Close icon
Modal Close icon