Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-machine-learning-algorithms-naive-bayes-with-spark-mllib
Wilson D'souza
07 Nov 2017
7 min read
Save for later

Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

Wilson D'souza
07 Nov 2017
7 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook, we look at how to implement Naïve Bayes classification algorithm with Spark 2.0 MLlib. The associated code and exercise are available at the end of the article.[/box] How to implement Naive Bayes with Spark MLlib Naïve Bayes is one of the most widely used classification algorithms which can be trained and optimized quite efficiently. Spark’s machine learning library, MLlib, primarily focuses on simplifying machine learning and has great support for multinomial naïve Bayes and Bernoulli naïve Bayes. Here we use the famous Iris dataset and use Apache Spark API NaiveBayes() to classify/predict which of the three classes of flower a given set of observations belongs to. This is an example of a multi-class classifier and requires multi-class metrics for measurements of fit. Let’s have a look at the steps to achieve this: For the Naive Bayes exercise, we use a famous dataset called iris.data, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups. In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolour, Iris Virginica).We can download the data from here: https://archive.ics.uci.edu/ml/datasets/Iris/  The column definition is as follows: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm  Class: -- Iris Setosa => Replace it with 0 -- Iris Versicolour => Replace it with 1 -- Iris Virginica => Replace it with 2 The steps/actions we need to perform on the data are as follows: Download and then replace column five (that is, the label or classification classes) with a numerical value, thus producing the iris.data.prepared data file. The Naïve Bayes call requires numerical labels and not text, which is very common with most tools. Remove the extra lines at the end of the file. Remove duplicates within the program by using the distinct() call. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter6 Import the necessary packages for SparkSession to gain access to the cluster and Log4j.Logger to reduce the amount of output produced by Spark:  import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics, MultilabelMetrics, binary} import org.apache.spark.sql.{SQLContext, SparkSession} import org.apache.log4j.Logger import org.apache.log4j.Level Initialize a SparkSession specifying configurations with the builder pattern, thus making an entry point available for the Spark cluster: val spark = SparkSession .builder .master("local[4]") .appName("myNaiveBayes08") .config("spark.sql.warehouse.dir", ".") .getOrCreate() val data = sc.textFile("../data/sparkml2/chapter6/iris.data.prepared.txt") Parse the data using map() and then build a LabeledPoint data structure. In this case, the last column is the Label and the first four columns are the features. Again, we replace the text in the last column (that is, the class of Iris) with numeric values (that is, 0, 1, 2) accordingly: val NaiveBayesDataSet = data.map { line => val columns = line.split(',') LabeledPoint(columns(4).toDouble , Vectors.dense(columns(0).toDouble,columns(1).toDouble,columns(2).to Double,columns(3).toDouble )) } Then make sure that the file does not contain any redundant rows. In this case, it has three redundant rows. We will use the distinct dataset going forward: println(" Total number of data vectors =", NaiveBayesDataSet.count()) val distinctNaiveBayesData = NaiveBayesDataSet.distinct() println("Distinct number of data vectors = ", distinctNaiveBayesData.count()) Output: (Total number of data vectors =,150) (Distinct number of data vectors = ,147) We inspect the data by examining the output: distinctNaiveBayesData.collect().take(10).foreach(println(_)) Output: (2.0,[6.3,2.9,5.6,1.8]) (2.0,[7.6,3.0,6.6,2.1]) (1.0,[4.9,2.4,3.3,1.0]) (0.0,[5.1,3.7,1.5,0.4]) (0.0,[5.5,3.5,1.3,0.2]) (0.0,[4.8,3.1,1.6,0.2]) (0.0,[5.0,3.6,1.4,0.2]) (2.0,[7.2,3.6,6.1,2.5]) .............. ................ ............. Split the data into training and test sets using a 30% and 70% ratio. The 13L in this case is simply a seeding number (L stands for long data type) to make sure the result does not change from run to run when using a randomSplit() method: val allDistinctData = distinctNaiveBayesData.randomSplit(Array(.30,.70),13L) val trainingDataSet = allDistinctData(0) val testingDataSet = allDistinctData(1) Print the count for each set: println("number of training data =",trainingDataSet.count()) println("number of test data =",testingDataSet.count()) Output: (number of training data =,44) (number of test data =,103) Build the model using train() and the training dataset: val myNaiveBayesModel = NaiveBayes.train(trainingDataSet Use the training dataset plus the map() and predict() methods to classify the flowers based on their features: val predictedClassification = testingDataSet.map( x => (myNaiveBayesModel.predict(x.features), x.label)) Examine the predictions via the output: predictedClassification.collect().foreach(println(_)) (2.0,2.0) (1.0,1.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (2.0,2.0) ....... ....... ....... Use MulticlassMetrics() to create metrics for the multi-class classifier. As a reminder, this is different from the previous recipe, in which we used BinaryClassificationMetrics(): val metrics = new MulticlassMetrics(predictedClassification) Use the commonly used confusion matrix to evaluate the model: val confusionMatrix = metrics.confusionMatrix println("Confusion Matrix= n",confusionMatrix) Output: (Confusion Matrix= ,35.0 0.0 0.0 0.0 34.0 0.0 0.0 14.0 20.0 ) We examine other properties to evaluate the model: val myModelStat=Seq(metrics.precision,metrics.fMeasure,metrics.recall) myModelStat.foreach(println(_)) Output: 0.8640776699029126 0.8640776699029126 0.8640776699029126 How it works... We used the IRIS dataset for this recipe, but we prepared the data ahead of time and then selected the distinct number of rows by using the NaiveBayesDataSet.distinct() API. We then proceeded to train the model using the NaiveBayes.train() API. In the last step, we predicted using .predict() and then evaluated the model performance via MulticlassMetrics() by outputting the confusion matrix, precision, and F-Measure metrics. The idea here was to classify the observations based on a selected feature set (that is, feature engineering) into classes that correspond to the left-hand label. The difference here was that we are applying joint probability given conditional probability to the classification. This concept is known as Bayes' theorem, which was originally proposed by Thomas Bayes in the 18th century. There is a strong assumption of independence that must hold true for the underlying features to make Bayes' classifier work properly. At a high level, the way we achieved this method of classification was to simply apply Bayes' rule to our dataset. As a refresher from basic statistics, Bayes' rule can be written as follows: The formula states that the probability of A given B is true is equal to probability of B given A is true times probability of A being true divided by probability of B being true. It is a complicated sentence, but if we step back and think about it, it will make sense. The Bayes' classifier is a simple yet powerful one that allows the user to take the entire probability feature space into consideration. To appreciate its simplicity, one must remember that probability and frequency are two sides of the same coin. The Bayes' classifier belongs to the incremental learner class in which it updates itself upon encountering a new sample. This allows the model to update itself on-the-fly as the new observation arrives rather than only operating in batch mode. We evaluated a model with different metrics. Since this is a multi-class classifier, we have to use MulticlassMetrics() to examine model accuracy. [box type="download" align="" class="" width=""]Download exercise and code files here. Exercise Files_Implementing Naive Bayes algorithm with Spark MLlib[/box] For more information on Multiclass Metrics, please see the following link: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib .evaluation.MulticlassMetrics Documentation for constructor can be found here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes If you enjoyed this article, you should have a look at Apache Spark 2.0 Machine Learning Cookbook which contains this excerpt.
Read more
  • 0
  • 0
  • 29257

article-image-pattern-mining-using-spark-mllib-part-2
Aarthi Kumaraswamy
06 Nov 2017
15 min read
Save for later

Pattern mining using Spark MLlib - Part 2

Aarthi Kumaraswamy
06 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] In part 1 of the tutorial, we motivated and introduced three pattern mining problems along with the necessary notation to properly talk about them. In part 2, we will now discuss how each of these problems can be solved with an algorithm available in Spark MLlib. As is often the case, actually applying the algorithms themselves is fairly simple due to Spark MLlib's convenient run method available for most algorithms. What is more challenging is to understand the algorithms and the intricacies that come with them. To this end, we will explain the three pattern mining algorithms one by one, and study how they are implemented and how to use them on toy examples. Only after having done all this will we apply these algorithms to a real-life data set of click events retrieved from http:/ / MSNBC. com. The documentation for the pattern mining algorithms in Spark can be found at https:/ / spark. apache. org/ docs/ 2. 1. 0/ mllib- frequent- pattern- mining. html. It provides a good entry point with examples for users who want to dive right in. Frequent pattern mining with FP-growth When we introduced the frequent pattern mining problem, we also quickly discussed a strategy to address it based on the apriori principle. The approach was based on scanning the whole transaction database again and again to expensively generate pattern candidates of growing length and checking their support. We indicated that this strategy may not be feasible for very large data. The so-called FP-growth algorithm, where FP stands for frequent pattern, provides an interesting solution to this data mining problem. The algorithm was originally described in Mining Frequent Patterns without Candidate Generation, available at https:/ /www. cs. sfu. ca/~jpei/ publications/sigmod00. pdf. We will start by explaining the basics of this algorithm and then move on to discussing its distributed version, parallel FP-growth, which has been introduced in PFP: Parallel FP-Growth for Query Recommendation, found at https:/ /static.googleusercontent.com/ media/research. google. com/en/ /pubs/ archive/ 34668. pdf. While Spark's implementation is based on the latter paper, it is best to first understand the baseline algorithm and extend from there. The core idea of FP-growth is to scan the transaction database D of interest precisely once in the beginning, find all the frequent patterns of length 1, and build a special tree structure called FP-tree from these patterns. Once this step is done, instead of working with D, we only do recursive computations on the usually much smaller FP-tree. This step is called the FP-growth step of the algorithm, since it recursively constructs trees from the subtrees of the original tree to identify patterns. We will call this procedure fragment pattern growth, which does not require us to generate candidates but is rather built on a divide-and-conquer strategy that heavily reduces the workload in each recursion step. To be more precise, let's first define what an FP-tree is and what it looks like in an example. Recall the example database we used in the last section, shown in Table 1. Our item set consisted of the following 15 grocery items, represented by their first letter: b, c, a, e, d, f, p, m, i, l, o, h, j, k, s. We also discussed the frequent items; that is, patterns of length 1, for a minimum support threshold of t = 0.6, were given by {f, c, b, a, m, p}. In FP-growth, we first use the fact that the ordering of items does not matter for the frequent pattern mining problem; that is, we can choose the order in which to present the frequent items. We do so by ordering them by decreasing frequency. To summarize the situation, let's have a look at the following table: Transaction ID Transaction Ordered frequent items 1 a, c, d, f, g, i, m, p f, c, a, m, p 2 a, b, c, f, l, m, o f, c, a, b, m 3 b, f, h, j, o f, b 4 b, c, k, s, p c, b, p 5 a, c, e, f, l, m, n, p f, c, a, m, p Table 3: Continuation of the example started with Table 1, augmenting the table by ordered frequent items. As we can see, ordering frequent items like this already helps us to identify some structure. For instance, we see that the item set {f, c, a, m, p} occurs twice and is slightly altered once as {f, c, a, b, m}. The key idea of FP-growth is to use this representation to build a tree from the ordered frequent items that reflect the structure and interdependencies of the items in the third column of Table 3. Every FP-tree has a so-called root node that is used as a base for connecting ordered frequent items as constructed. On the right of the following diagram, we see what is meant by this: Figure 1: FP-tree and header table for our frequent pattern mining running example. The left-hand side of Figure 1 shows a header table that we will explain and formalize in just a bit, while the right-hand side shows the actual FP-tree. For each of the ordered frequent items in our example, there is a directed path starting from the root, thereby representing it. Each node of the tree keeps track of not only the frequent item itself but also of the number of paths traversed through this node. For instance, four of the five ordered frequent item sets start with the letter f and one with c. Thus, in the FP-tree, we see f: 4 and c: 1 at the top level. Another interpretation of this fact is that f is a prefix for four item sets and c for one. For another example of this sort of reasoning, let's turn our attention to the lower left of the tree, that is, to the leaf node p: 2. Two occurrences of p tells us that precisely two identical paths end here, which we already know: {f, c, a, m, p} is represented twice. This observation is interesting, as it already hints at a technique used in FP-growth--starting at the leaf nodes of the tree, or the suffixes of the item sets, we can trace back each frequent item set, and the union of all these distinct root node paths yields all the paths--an important idea for parallelization. The header table you see on the left of Figure 1 is a smart way of storing items. Note that by the construction of the tree, a node is not the same as a frequent item but, rather, items can and usually do occur multiple times, namely once for each distinct path they are part of. To keep track of items and how they relate, the header table is essentially a linked list of items, that is, each item occurrence is linked to the next by means of this table. We indicated the links for each frequent item by horizontal dashed lines in Figure 1 for illustration purposes. With this example in mind, let's now give a formal definition of an FP-tree. An FP-tree T is a tree that consists of a root node together with frequent item prefix subtrees starting at the root and a frequent item header table. Each node of the tree consists of a triple, namely the item name, its occurrence count, and a node link referring to the next node of the same name, or null if there is no such next node. To quickly recap, to build T, we start by computing the frequent items for the given minimum support threshold t, and then, starting from the root, insert each path represented by the sorted frequent pattern list of a transaction into the tree. Now, what do we gain from this? The most important property to consider is that all the information needed to solve the frequent pattern mining problem is encoded in the FP-tree T because we effectively encode all co-occurrences of frequent items with repetition. Since T can also have at most as many nodes as the occurrences of frequent items, T is usually much smaller than our original database D. This means that we have mapped the mining problem to a problem on a smaller data set, which in itself reduces the computational complexity compared with the naive approach sketched earlier. Next, we'll discuss how to grow patterns recursively from fragments obtained from the constructed FP tree. To do so, let's make the following observation. For any given frequent item x, we can obtain all the patterns involving x by following the node links for x, starting from the header table entry for x, by analyzing at the respective subtrees. To explain how exactly, we further study our example and, starting at the bottom of the header table, analyze patterns containing p. From our FP-tree T, it is clear that p occurs in two paths: (f:4, c:3, a:3, m:3, p:2) and (c:1, b:1, p:1), following the node links for p. Now, in the first path, p occurs only twice, that is, there can be at most two total occurrences of the pattern {f, c, a, m, p} in the original database D. So, conditional on p being present, the paths involving p actually read as follows: (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1). In fact, since we know we want to analyze patterns, given p, we can shorten the notation a little and simply write (f:2, c:2, a:2, m:2) and (c:1, b:1). This is what we call the conditional pattern base for p. Going one step further, we can construct a new FP-tree from this conditional database. Conditioning on three occurrences of p, this new tree does only consist of a single node, namely (c:3). This means that we end up with {c, p} as a single pattern involving p, apart from p itself. To have a better means of talking about this situation, we introduce the following notation: the conditional FP-tree for p is denoted by {(c:3)}|p. To gain more intuition, let's consider one more frequent item and discuss its conditional pattern base. Continuing bottom to top and analyzing m, we again see two paths that are relevant: (f:4, c:3, a:3, m:2) and (f:4, c:3, a:3, b:1, m:1). Note that in the first path, we discard the p:2 at the end, since we have already covered the case of p. Following the same logic of reducing all other counts to the count of the item in question and conditioning on m, we end up with the conditional pattern base {(f:2, c:2, a:2), (f:1, c:1, a:1, b:1)}. The conditional FP- tree in this situation is thus given by {f:3, c:3, a:3}|m. It is now easy to see that actually every possible combination of m with each of f, c, and a forms a frequent pattern. The full set of patterns, given m, is thus {m}, {am}, {cm}, {fm}, {cam], {fam}, {fcm}, and {fcam}. By now, it should become clear as to how to continue, and we will not carry out this exercise in full but rather summarize the outcome of it in the following table: Frequent pattern Conditional pattern base Conditional FP-tree p {(f:2, c:2, a:2, m:2), (c:1, b:1)} {(c:3)}|p m {(f :2, c:2, a:2), (f :1, c:1, a:1, b:1)} {f:3, c:3, a:3}|m b {(f :1, c:1, a:1), (f :1), (c:1)} null a {(f:3, c:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f null null Table 4: The complete list of conditional FP-trees and conditional pattern bases for our running example. As this derivation required a lot of attention to detail, let's take a step back and summarize the situation so far: Starting from the original FP-tree T, we iterated through all the items using node links. For each item x, we constructed its conditional pattern base and its conditional FP-tree. Doing so, we used the following two properties:We discarded all the items following x in each potential pattern, that is, we only kept the prefix of xWe modified the item counts in the conditional pattern base to match the count of x Modifying a path using the latter two properties, we called the transformed prefix path of x. To finally state the FP-growth step of the algorithm, we need two more fundamental observations that we have already implicitly used in the example. Firstly, the support of an item in a conditional pattern base is the same as that of its representation in the original database. Secondly, starting from a frequent pattern x in the original database and an arbitrary set of items y, we know that xy is a frequent pattern if and only if y is. These two facts can easily be derived in general, but should be clearly demonstrated in the preceding example. What this means is that we can completely focus on finding patterns in conditional pattern bases, as joining them with frequent patterns is again a pattern, and this way, we can find all the patterns. This mechanism of recursively growing patterns by computing conditional pattern bases is therefore called pattern growth, which is why FP-growth bears its name. With all this in mind, we can now summarize the FP-growth procedure in pseudocode, as follows: def fpGrowth(tree: FPTree, i: Item): if (tree consists of a single path P){ compute transformed prefix path P' of P return all combinations p in P' joined with i } else{ for each item in tree { newI = i joined with item construct conditional pattern base and conditional FP-tree newTree call fpGrowth(newTree, newI) } } With this procedure, we can summarize our description of the complete FP-growth algorithm as follows: Compute frequent items from D and compute the original FP-tree T from them (FP-tree computation). Run fpGrowth(T, null) (FP-growth computation). Having understood the base construction, we can now proceed to discuss a parallel extension of base FP-growth, that is, the basis of Spark's implementation. Parallel FP- growth, or PFP for short, is a natural evolution of FP-growth for parallel computing engines such as Spark. It addresses the following problems with the baseline algorithm: Distributed storage: For frequent pattern mining, our database D may not fit into memory, which can already render FP-growth in its original form unapplicable. Spark does help in this regard for obvious reasons. Distributed computing: With distributed storage in place, we will have to take care of parallelizing all the steps of the algorithm suitably as well and PFP does precisely this. Adequate support values: When dealing with finding frequent patterns, we usually do not want to set the minimum support threshold t too high so as to find interesting patterns in the long tail. However, a small t might prevent the FP-tree from fitting into memory for a sufficiently large D, which would force us to increase t. PFP successfully addresses this problem as well, as we will see. The basic outline of PFP, with Spark for implementation in mind, is as follows: Sharding: Instead of storing our database D on a single machine, we distribute it to multiple partitions. Regardless of the particular storage layer, using Spark we can, for instance, create an RDD to load D. Parallel frequent item count: The first step of computing frequent items of D can be naturally performed as a map-reduce operation on an RDD. Building groups of frequent items: The set of frequent items is divided into a number of groups, each with a unique group ID. Parallel FP-growth: The FP-growth step is split into two steps to leverage parallelism: Map phase: The output of a mapper is a pair comprising the group ID and the corresponding transaction. Reduce phase: Reducers collect data according to the group ID and carry out FP-growth on these group-dependent transactions. Aggregation: The final step in the algorithm is the aggregation of results over group IDs. In light of already having spent a lot of time with FP-growth on its own, instead of going into too many implementation details of PFP in Spark, let's instead see how to use the actual algorithm on the toy example that we have used throughout: import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val transactions: RDD[Array[String]] = sc.parallelize(Array( Array("a", "c", "d", "f", "g", "i", "m", "p"), Array("a", "b", "c", "f", "l", "m", "o"), Array("b", "f", "h", "j", "o"), Array("b", "c", "k", "s", "p"), Array("a", "c", "e", "f", "l", "m", "n", "p") )) val fpGrowth = new FPGrowth() .setMinSupport(0.6) .setNumPartitions(5) val model = fpGrowth.run(transactions) model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq) } The code is straightforward. We load the data into transactions and initialize Spark's FPGrowth implementation with a minimum support value of 0.6 and 5 partitions. This returns a model that we can run on the transactions constructed earlier. Doing so gives us access to the patterns or frequent item sets for the specified minimum support, by calling freqItemsets, which, printed in a formatted way, yields the following output of 18 patterns in total: [box type="info" align="" class="" width=""]Recall that we have defined transactions as sets, and we often call them item sets. This means that within such an item set, a particular item can only occur once, and FPGrowth depends on this. If we were to replace, for instance, the third transaction in the preceding example by Array("b", "b", "h", "j", "o"), calling run on these transactions would throw an error message. We will see later on how to deal with such situations.[/box] The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn how to fully implement and deploy pattern mining applications in Spark among other machine learning tasks using Spark, check out the book. 
Read more
  • 0
  • 0
  • 5784

article-image-pattern-mining-using-spark-part-1
Aarthi Kumaraswamy
03 Nov 2017
15 min read
Save for later

Pattern Mining using Spark MLlib - Part 1

Aarthi Kumaraswamy
03 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following two-part tutorial is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this tutorial is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides. In this tutorial, we will look into doing pattern mining in Spark. The tutorial is split up into two main sections. In the first, we will first introduce the three available pattern mining algorithms that Spark currently comes with and then apply them to an interesting dataset. In particular, you will learn the following from this two-part tutorial: The basic principles of frequent pattern mining. Useful and relevant data formats for applications. Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span. Frequent pattern mining When presented with a new data set, a natural sequence of questions is: What kind of data do we look at; that is, what structure does it have? Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data? How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it? On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data. To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item. Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three: Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often brought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system. Based on the previous question, are there any interesting implications or rules to observe in shopping behavior?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean. Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items. The reason we focus on these three questions, in particular, is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP- growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection. Pattern mining terminology We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions. To give a concrete example of this, consider the following situation. Assume that the full item set to shop from is given by I = {bread, cheese, ananas, eggs, donuts, fish, pork, milk, garlic, ice cream, lemon, oil, honey, jam, kale, salt}. Since we will look at a lot of item subsets, to make things more readable later on, we will simply abbreviate these items by their first letter, that is, we'll write I = {b, c, a, e, d, f, p, m, g, i, l, o, h, j, k, s}. Given these items, a small transaction database D could look as follows:   Transaction ID Transaction 1 a, c, d, f, g, i, m, p 2 a, b, c, f, l, m, o 3 b, f, h, j, o 4 b, c, k, s, p 5 a, c, e, f, l, m, n, p Table 1: A small shopping cart database with five transactions Frequent pattern mining problem Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D: supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D| We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items. Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance. Given a minimum support threshold, the problem of finding all the frequent patterns is called the frequent pattern mining problem and it is, in fact, the formalized version of the aforementioned first question. Continuing with our example, we have found all frequent patterns of length 1 for t = 0.6 already. How do we find longer patterns? On a theoretical level, given unlimited resources, this is not much of a problem, since all we need to do is count the occurrences of items. On a practical level, however, we need to be smart about how we do so to keep the computation efficient. Especially for databases large enough for Spark to come in handy, it can be very computationally intense to address the frequent pattern mining problem. One intuitive way to go about this is as follows: Find all the frequent patterns of length 1, which requires one full database scan. This is how we started with in our preceding example. For patterns of length 2, generate all the combinations of frequent 1-patterns, the so-called candidates, and test if they exceed the minimum support by doing another scan of D. Importantly, we do not have to consider the combinations of infrequent patterns, since patterns containing infrequent patterns can not become frequent. This rationale is called the apriori principle. For longer patterns, continue this procedure iteratively until there are no more patterns left to combine. This algorithm, using a generate-and-test approach to pattern mining and utilizing the apriori principle to bound combinations, is called the apriori algorithm. There are many variations of this baseline algorithm, all of which share similar drawbacks in terms of scalability. For instance, multiple full database scans are necessary to carry out the iterations, which might already be prohibitively expensive for huge datasets. On top of that, generating candidates themselves is already expensive, but computing their combinations might simply be infeasible. In the next section, we will see how a parallel version of an algorithm called FP-growth, available in Spark, can overcome most of the problems just discussed. The association rule mining problem To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows: conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1) This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity. In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case: {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p} {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p} {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m} {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p} {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m} From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront. [box type="info" align="" class="" width=""]At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http:/ / www. cse. msu. edu/ ~ptan/ papers/ IS. pdf.[/box] The sequential pattern mining problem Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows: s = <s1, s2, ..., sl> Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example:   Sequence ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Table 2: A small sequence database with four short sequences. In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u1, ..., un) a subsequence of s = (s1, ..., sl) and write u < s if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following: u1 < si1, ..., un < sin Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>. With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows: suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S| Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t. Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section. Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced. If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.
Read more
  • 0
  • 0
  • 5041

article-image-classification-decision-trees-apache-spark-mllib
Wilson D'souza
02 Nov 2017
9 min read
Save for later

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza
02 Nov 2017
9 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster:  Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file:  val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset:  val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count:  println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset:  val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset:  def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters:  val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session:  spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.
Read more
  • 0
  • 0
  • 26234

article-image-building-motion-charts-tableau
Ashwin Nair
31 Oct 2017
4 min read
Save for later

Building Motion Charts with Tableau

Ashwin Nair
31 Oct 2017
4 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Tableau 10 Bootcamp, Chapter 2, Interactivity – written by Joshua N. Milligan and Donabel Santos. It offers intensive training on Data Visualization and Dashboarding with Tableau 10. In this article, we will learn how to build motion charts with Tableau.[/box] Tableau is an amazing platform for achieving incredible data discovery, analysis, and Storytelling. It allows you to build fully interactive dashboards and stories with your visualizations and insights so that you can share the data story with others. Creating Motion Charts with Tableau Let`s learn how to build motion charts with Tableau. A motion chart, as its name suggests, is a chart that displays the entire trail of changes in data over time by showing movement using the X and Y-axes. It is very much similar to the doodles in our notebooks which seem to come to life after flipping through the pages. It is amazing to see the same kind of movement in action in Tableau using the Pagesshelf. It is work that feels like play. On the Pages shelf, when you drop a field, Tableau creates a sequence of pages that filters the view for each value in that field. Tableau's page control allows us to flip pages, enabling us to see our view come to life. With three predefined speed settings, we can control the speed of the flip. The three settings include one that relates to the slowest speed, the others to the fastest speed. We can also format the marks and show the marks or trails, or both, using page control. In our viz, we have used a circle for marking each year. The circle that moves to a new position each year represents the specific country's new population value. These circles are all connected by trail lines that enable us to simulate a moving time series graph by setting the  mark and trail histories both to show in page control: Let's create an animated motion chart showing the population change over the years for a selected few countries: Open the Motion Chart worksheet and connect to the CO2 (Worldbank) data Source: Open Dimensions and drag Year to the Columns shelf. Open Measures and drag CO2 Emission to the Rows shelf. Right-click on the CO2 Emission axis, and change the title to CO2 Emission (metric tons per capita): In the Marks card, click on the dropdown to change the mark from Automatic to Circle. Open Dimensions and drag Country Name to Color in the Marks card. Also, drag Country Name to the Filter shelf from Dimensions Under the General tab of the Filter window, while the Select from list radio button is selected, select None. Select the Custom value list radio button, still under the General tab, and add China, Trinidad and Tobago, and United States: Click OK when done. This should close the Filter window. Open Dimensions and drag Year to Pages for adding a page control to the view. Click on the Show history checkbox to select it. Click on the drop-down beside Show history and perform the following steps: Select All for Marks to show history for Select Both for Show Using the Year page control, click on the forward arrow to play. This shows the change in the population of the three selected countries over the years. [box type="info" align="" class="" width=""]Tip -  In case you ever want to loopback the animation, you can click on the dropdown on the top-right of your page control card, and select Loop Playback:[/box] Note that Tableau Server does not support the animation effect that you see when working on motion charts with Tableau Desktop. Tableau strives for zero footprints when serving the charts and dashboards on the server so that there is no additional download to enable the functionalities. So, the play control does not work the same. No need to fret though. You can click manually on the slider and have a similar effect.  If you liked the above excerpt from the book Tableau 10 Bootcamp, check out the book to learn more data visualization techniques.
Read more
  • 0
  • 0
  • 10981

article-image-halloween-costume-data-science-nerds
Packt Editorial Staff
31 Oct 2017
14 min read
Save for later

(13*3)+ Halloween costume ideas for Data science nerds

Packt Editorial Staff
31 Oct 2017
14 min read
Are you a data scientist, a machine learning engineer, an AI researcher or simply a data enthusiast? Channel the inner data science nerd within you with these geeky ideas for your Halloween costumes! The Data Science Spectrum Don't know what to go as to this evening's party because you've been busy cleaning that terrifying data? Don’t worry, here are some easy-to-put-together Halloween costume ideas just for you. [dropcap]1[/dropcap] Big Data Go as Baymax, the healthcare robot, (who can also turn into battle mode when required). Grab all white clothes that you have. Stuff your tummy with some pillows and wear a white mask with cutouts for eyes. You are all ready to save the world. In fact, convince a friend or your brother to go as Hiro! [dropcap]2[/dropcap] A.I. agent Enter as Agent Smith, the AI antagonist, this Halloween. Lure everyone with your bold black suit paired with a white shirt and a black tie. A pair of polarized sunglasses would replicate you as the AI agent. Capture the crowd by being the most intelligent and cold-hearted personality of all. [dropcap]3[/dropcap] Data Miner Put on your dungaree with a tee. Fix a flashlight atop your cap. Grab a pickaxe from the gardening toolkit, if you have one. Stripe some mud onto your face. Enter the party wheeling with loads of data boxes that you have freshly mined. You’ll definitely grab some traffic for data. Unstructured data anyone? [dropcap]4[/dropcap] Data Lake Go as a Data lake this Halloween. Simply grab any blue item from your closet. Draw some fishes, crabs, and weeds. (Use a child’s marker for that). After all, it represents the data you have. And you’re all set. [dropcap]5[/dropcap] Dark Data Unleash the darkness within your soul! Just kidding. You don’t actually have to turn to the evil side. Just coming up with your favorite black-costume character would do. Looking for inspiration? Maybe, a witch, The dark knight, or The Darth Vader. [dropcap]6[/dropcap] Cloud A fluffy, white cloud is what you need to be this Halloween. Raid your nearby drug store for loads of cotton balls. Better still, tear up that old pillow you have been meaning to throw away for a while. Use the fiber inside to glue onto an unused tee. You will be the cutest cloud ever seen. Don’t forget to carry an umbrella in case you turn grey! [dropcap]7[/dropcap] Predictive Analytics Make your own paper wizard hat with silver stars and moons pasted on it. If you can arrange for an advocate gown, it would be great. Else you could use a long black bed sheet as a cape. And most importantly, a crystal ball to show off some prediction stunts at the Halloween. [dropcap]8[/dropcap] Gradient boosting Enter Halloween as the energy booster. Wear what you want. Grab loads of empty energy drink tetra packs and stick it all over you. Place one on your head too. Wear a nameplate that says “ G-booster Energy drink”. Fuel up some weak models this Halloween. [dropcap]9[/dropcap] Cryptocurrency Wear head to toe black. In fact, paint your face black as well, like the Grim reaper. Then grab a cardboard piece. Cut out a circle, paint it orange, and then draw a gold B symbol, just like you see in a bitcoin. This Halloween costume will definitely grab you the much-needed attention just as this popular cryptocurrency. [dropcap]10[/dropcap] IoT Are you a fan of IoT and the massive popularity it has gained? Then you should definitely dress up as your web-slinging, friendly neighborhood Spiderman. Just grab a spiderman costume from any costume store and attach some handmade web slings. Remember to connect with people by displaying your IoT knowledge. [dropcap]11[/dropcap] Self-driving car Choose a mono-color outfit of your choice (P.S. The color you would choose for your car). Cut out four wheels and paste two on your lower calves and two on your arms. Cut out headlights too. Put on a wiper goggle. And yes you do not need a steering wheel or the brakes, clutch and the accelerator. Enter the Halloween at your own pace, go self-driving this Halloween. Bonus point: You can call yourself Bumblebee or Optimus Prime. Machine Learning and Deep learning Frameworks If machine learning or deep learning is your forte, here are some fresh Halloween costume ideas based on some of the popular frameworks in that space. [dropcap]12[/dropcap] Torch Flame up the party with a costume inspired by the fantastic four superhero, Johnny Storm a.k.a The Human Torch. Wear a yellow tee and orange slacks. Draw some orange flames on your tee. And finally, wear a flame-inspired headband. Someone is a hot machine learning library! [dropcap]13[/dropcap] TensorFlow No efforts for this one. Just arrange for a pumpkin costume, paste a paper cut-out of the TensorFlow logo and wear it as a crown. Go as the most powerful and widely popular deep learning library. You will be the star of the Halloween as you are a Google Kid. [dropcap]14[/dropcap] Caffe Go as your favorite Starbucks coffee this Halloween. Wear any of your brown dress/ tee. Draw or stick a Starbucks logo. And then add frothing to the top by bunching up a cream-colored sheet. Mamma Mia! [dropcap]15[/dropcap] Pandas Go as a Panda this Halloween! Better still go as a group of Pandas. The best option is to buy a panda costume. But if you don’t want that, wear a white tee, black slacks, black goggles and some cardboard cutouts for ears. This will make you not only the cutest animal in the party but also a top data manipulation library. Good luck finding your python in the party by the way. [dropcap]16[/dropcap] Jupyter Notebook Go as a top trending open-source web application by dressing up as the largest planet in our solar system. People would surely be intimidated by your mass and also by your computing power. [dropcap]17[/dropcap] H2O Go to Halloween as a world famous open source deep learning platform. No, no, you don’t have to go as the platform itself. Instead go as the chemical alter-ego, water. Wear all blue and then grab some leftover asymmetric, blue cloth pieces to stick at your sides. Thirsty anyone? Data Viz & Analytics Tools If you’re all about analytics and visualization, grab the attention of every data geek in your party by dressing up as your favorite data insight tools. [dropcap]18[/dropcap] Excel Grab an old white tee and paint some green horizontal stripes. You’re all ready to go as the most widely used spreadsheet. The simplest of costumes, yet the most useful - a timeless classic that never goes out of fashion. [dropcap]19[/dropcap] MatLab If you have seriously run out of all costume ideas, going out as MatLab is your only solution. Just grab a blue tablecloth. Stick or sew it with some orange curtain and throw it over your head. You’re all ready to go as the multi-paradigm numerical computing environment. [dropcap]20[/dropcap] Weka Wear a brown overall, a brown wig, and paint your face brown. Make an orange beak out of a chart paper, and wear a pair orange stockings/ socks with your trousers tucked in. You are all set to enter as a data mining bird with ML algorithms and Java under your wings. [dropcap]21[/dropcap] Shiny Go all Shimmery!! Get some glitter powder and put it all over you. (You’ll have a tough time removing it though). Else choose a glittery outfit, with glittery shoes, and touch-up with some glitter on your face. Let the party see the bling of R that you bring. You will be the attractive storyteller out there. [dropcap]22[/dropcap] Bokeh A colorful polka-dotted outfit and some dim lights to do the magic. You are all ready to grab the show with such a dazzle. Make sure you enter the party gates with Python. An eye-catching beauty with the beast pair. [dropcap]23[/dropcap] Tableau Enter the Halloween as one of your favorite characters from history. But there is a term and condition for this: You cannot talk or move. Enjoy your Halloween by being still. Weird, but you’ll definitely grab everyone’s eye. [dropcap]24[/dropcap] Microsoft Power BI Power up your Halloween party by entering as a data insights superhero. Wear a yellow turtleneck, a stylish black leather jacket, black pants, some mid-thigh high boots and a slick attitude. You’re ready to save your party! Data Science oriented Programming languages These hand-picked Halloween costume ideas are for you if you consider yourself a top coder. By a top coder we mean you’re all about learning new programming languages in your spare and, well, your not so spare time.   [dropcap]25[/dropcap] Python Easy peasy as the language looks, the reptile is not that easy to handle. A pair of python-printed shirt and trousers would do the job. You could be getting more people giving you candies some out of fear, other out of the ease. Definitely, go as a top trending and a go-to language which everyone loves! And yes, don’t forget the fangs. [dropcap]26[/dropcap] R Grab an eye patch and your favorite leather pants. Wear a loose white shirt with some rugged waistcoat and a sword. Here you are all decked up as a pirate for your next loot. You’ll surely thank me for giving you a brilliant Halloween idea. But yes! Don’t forget to make that Arrrr (R) noise! [dropcap]27[/dropcap] Java Go as a freshly roasted coffee bean! People in your Halloween party would be allured by your aroma. They would definitely compliment your unique idea and also the fact that you’re the most popular programming language. [dropcap]28[/dropcap] SAS March in your Halloween party up as a Special Airforce Service (SAS) agent. You would be disciplined, accurate, precise and smart. Just like the advanced software suite that goes by the same name. You would need a full black military costume, with a gas mask, some fake ammunition from a nearby toy store, and some attitude of course! [dropcap]29[/dropcap] SQL If you pride yourself on being very organized or are a stickler for the rules, you should go as SQL this Halloween. Prep-up yourself with an overall blue outfit. Spike up your hair and spray some temporary green hair color. Cut out bold letters S, Q, and L from a plain white paper and stick them on your chest. You are now ready to enter the Halloween party as the most popular database of all times. Sink in all the data that you collect this Halloween. [dropcap]30[/dropcap] Scala If Scala is your favorite programming language, add a spring to your Halloween by going as, well, a spring! Wear the brightest red that you have. Using a marker, draw some swirls around your body (You can ask your mom to help). Just remember to elucidate a 3D picture. And you’re all set. [dropcap]31[/dropcap] Julia If you want to make a red carpet entrance to your Halloween party, go as the Academy award-winning actress, Julia Roberts. You can even take up inspiration from her character in the 90s hit film Pretty Woman. For extra oomph, wear a pink, red, and purple necklace to highlight the Julia programming language [dropcap]32[/dropcap] Ruby Act pricey this Halloween. Be the elegant, dynamic yet simple programming language. Go blood red, wear on your brightest red lipstick, red pumps, dazzle up with all the red accessories that you have. You’ll definitely gather some secret admirers around the hall. [dropcap]33[/dropcap] Go Go as the mascot of Go, the top trending programming language. All you need is a blue mouse costume. Fear not if you don’t have one. Just wear a powder blue jumpsuit, grab a baby pink nose, and clip on a fake single, large front tooth. Ready for the party! [dropcap]34[/dropcap] Octave Go as a numerically competent programming language. And if that doesn’t sound very trendy, go as piano keys depicting an octave. You simply need to wear all white and divide your space into 8 sections. Then draw 5 horizontal black stripes. You won’t be able to do that vertically, well, because they are a big number. Here you go, you’re all set to fill the party with your melody. Fancy an AI system inspired Halloween costume? This is for you if you love the way AI works and the enigma that it has thrown around the world. This is for you if you are spellbound with AI magic. You should go dressed as one of these at your Halloween party this season. Just pick up the AI you want to look like and follow as advised. [dropcap]35[/dropcap] IBM Watson Wear a dark blue hat, a matching long overcoat, a vest and a pale blue shirt with a dark tie tucked into the vest. Complement it with a mustache and a brooding look. You are now ready to be IBM Watson at your Halloween party. [dropcap]36[/dropcap] Apple Siri If you want to be all cool and sophisticated like the Apple’s Siri, wear an alluring black turtleneck dress. Don’t forget to carry your latest iPhone and air pods. Be sure you don’t have a sore throat, in case someone needs your assistance. [dropcap]37[/dropcap] Microsoft Cortana If Microsoft Cortana is your choice of voice assistant, dress up as Cortana, the fictional synthetic intelligence character in the Halo video game series. Wear a blue bodysuit. Get a bob if you’re daring. (A wig would also do). Paint some dark blue robot like designs over your body and well, your face. And you’re all set. [dropcap]38[/dropcap] Salesforce Einstein Dress up as the world’s most famous physicist and also an AI-powered CRM. How? Just grab a white shirt, a blue pullover and a blue tie (Salesforce colors). Finish your look with a brown tweed coat, brown pants and shoes, a rugged white wig and mustache, and a deep thought on your face. [dropcap]39[/dropcap] Facebook Jarvis Get inspired by the Iron man’s Jarvis, the coolest A.I. in the Marvel universe. Just grab a plexiglass, draw some holograms and technological symbols over it with a neon marker. (Try to keep the color palette in shades of blues and reds). And fix this plexiglass in a curved fashion in front of your face by a headband. Do practice saying “Hello Mr. Stark.”  [dropcap]40[/dropcap] Amazon Echo This is also an easy one. Grab a long, black chart paper. Roll it around in a tube form around your body. Draw the Amazon symbol at the bottom with some glittery, silver sketch pen, color your hair blue, and there you go. If you have a girlfriend, convince her to go as Amazon Alexa. [dropcap]41[/dropcap] SAP Leonardo Put on a hat, wear a long cloak, some fake overgrown mustache, and beard. Accessorize with a color palette and a paintbrush. You will be the Leonardo da Vinci of the Halloween party. Wait a minute, don’t forget to cut out SAP initials and stick them on your cap. After all, you are entering as SAP’s very own digital revolution system. [dropcap]42[/dropcap] Intel Neon Deck the Halloween hall with a Harley Quinn costume. For some extra dramatization, roll up some neon blue lights around your head. Create an Intel logo out of some blue neon lights and wear it as your neckpiece. [dropcap]43[/dropcap] Microsoft Brainwave This one will require a DIY task. Arrange for a red and green t-shirt, cut them into a vertical half. Stitch it in such a way that the green is on the left and the red on the right. Similarly, do that with your blue and yellow pants; with yellow on the left and blue on the right. You will look like the most powerful Microsoft’s logo. Wear a skullcap with wires protruding out and a Hololens like eyewear to go with. And so, you are all ready to enter the Halloween party as Microsoft’s deep learning acceleration platform for real-time AI. [dropcap]44[/dropcap] Sophia, the humanoid Enter with all the confidence and a top-to-toe professional attire. Be ready to answer any question thrown at you with grace and without a stroke of skepticism. And to top it off, sport a clean shaved head. And there, you are all ready to blow off everyone’s mind with a mix of beauty with super intelligent brains.   Happy Halloween folks!
Read more
  • 0
  • 0
  • 29251
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-halloween-costume-ideas-inspired-apache-big-data-projects
Packt Editorial Staff
30 Oct 2017
3 min read
Save for later

Halloween costume ideas inspired from Apache Big Data Projects

Packt Editorial Staff
30 Oct 2017
3 min read
If you are a busy person who is finding it difficult to decide a Halloween costume for your office party tomorrow or for your kid's trick-or-treating madness, here are some geeky Halloween costume ideas that will make the inner data nerd in you proud! Apache Hadoop Be the cute little yellow baby elephant everyone wants to cuddle. Just grab all the yellow clothes you have. If you don’t, borrow them. Don’t forget to stuff in mini cushions in you. Pop in loads of candy in your mouth. And there, you’re all set to be as the dominant but the cutest framework! Cuteness overloaded. Apache Hive Be the buzz of your Halloween party by going as a top Apache data warehouse. What to wear you ask? Hum around wearing a yellow and white striped dress or a shirt. Compliment your outfit with a pair of black wings, headband with antennae and a small pot of honey.   Apache Storm An X-Men fan are you? Go as Storm, the popular fictional superhero. Wear a black bodysuit (leather if possible). Drape a long cape. Put on a grey wig. And channel your inner power. Perhaps people would be able to see the powerful weather-controlling mutant in you and also recognize your ability to process streaming data in real time. Apache Kafka Go all out gothic with an Apache Kafka costume. Dress in a serious black dress and gothic makeup. Don’t forget your black butterfly wings and a choker necklace with linked circles. Keep asking existential questions to random people at the party to throw them off balance. Apache Giraph Put on a yellow tee and brown trousers, cut out some brown imperfect circles and paste them on your tee. Put on a brown cap, and paint your ears brown. Draw some graph representations using a marker all over your hands and palms. You are now Apache Giraph. Apache Singa Be the blend of a flexible Apache Singa with the ferocity of a lion this Halloween! All you need is a yellow tee paired with light brown trousers. Wear a lion’s wig. Grab a mascara and draw some strokes on your cheeks. Paint the tip of your nose using a brown watercolour or some melted chocolate. Apache Spark If you have obsessed over Pokémon Go and equally love the lightning blaze data processing speed of Apache Spark, you should definitely go as the leader of Pokémon Go's Team Instinct. Spark wears an orange hoodie, a black and yellow leather jacket, black jeans and orange gloves. Do remember to carry your Pokemon balls in case you are challenged for a battle. Apache Pig A dark blue dungaree paired with a baby pink tee, a pair of white gloves, purple shoes and yes, a baby pink chart paper cut out of the pig’s face. Wear all of this on and you will look like an Apache Pig. Complement the look with a wide grin when you make an entrance. [caption id="attachment_1414" align="aligncenter" width="708"] Two baby boys dressed in animal costumes in autumn park, focus on baby in elephant costume[/caption] Happy Haloween folks! Watch this space for more data science themed Haloween costume ideas tomorrow.  
Read more
  • 0
  • 0
  • 15212

article-image-implementing-autoencoders-using-h2o
Amey Varangaonkar
27 Oct 2017
4 min read
Save for later

Implementing Autoencoders using H2O

Amey Varangaonkar
27 Oct 2017
4 min read
[box type="note" align="" class="" width=""]This excerpt is taken from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this article, we see how R is an effective tool for neural network modelling, by implementing autoencoders using the popular H2O library.[/box] An autoencoder is an ANN used for learning without efficient coding control. The purpose of an autoencoder is to learn coding for a set of data, typically to reduce dimensionality. Architecturally, the simplest form of autoencoder is an advanced and non-recurring neural network very similar to the MLP, with an input level, an output layer, and one or more hidden layers that connect them, but with the layer outputs having the same number of input level nodes for rebuilding their inputs. In this section, we present an example of implementing Autoencoders using H2O on a movie dataset. The dataset used in this example is a set of movies and genre taken from https://grouplens.org/datasets/movielens We use the movies.csv file, which has three columns: movieId title genres There are 164,979 rows of data for clustering. We will use h2o.deeplearning to have the autoencoder parameter fix the clusters. The objective of the exercise is to cluster the movies based on genre, which can then be used to recommend similar movies or same genre movies to the users. The program uses h20.deeplearning, with the autoencoder parameter set to T: library("h2o") setwd ("c://R") #Load the training dataset of movies movies=read.csv ( "movies.csv", header=TRUE) head(movies) model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") summary(model) features=h2o.deepfeatures(model, as.h2o(movies), layer=1) d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) Now, let's go through the code: library("h2o") setwd ("c://R") These commands load the library in the R environment and set the working directory where we will have inserted the dataset for the next reading. Then we load the data: movies=read.csv( "movies.csv", header=TRUE) To visualize the type of data contained in the dataset, we analyze a preview of one of these variables: head(movies) The following figure shows the first 20 rows of the movie dataset: Now we build and train model: model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") Let's analyze some of the information contained in model: summary(model) This is an extract from the results of the summary() function: In the next command, we use the h2o.deepfeatures() function to extract the nonlinear feature from an h2o dataset using an H2O deep learning model: features=h2o.deepfeatures(model, as.h2o(movies), layer=1) In the following code, the first six rows of the features extracted from the model are shown: > features DF.L1.C1 DF.L1.C2 1 0.2569208 -0.2837829 2 0.3437048 -0.2670669 3 0.2969089 -0.4235294 4 0.3214868 -0.3093819 5 0.5586608 0.5829145 6 0.2479671 -0.2757966 [9125 rows x 2 columns] Finally, we plot a diagram where we want to see how the model grouped the movies through the results obtained from the analysis: d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) The plot of the movies, once clustering is done, is shown next. We have plotted only 100 movie titles due to space issues. We can see some movies being closely placed, meaning they are of the same genre. The titles are clustered based on distances between them, based on genre. Given a large number of titles, the movie names cannot be distinguished, but what appears to be clear is that the model has grouped the movies into three distinct groups. If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.
Read more
  • 0
  • 1
  • 38655

article-image-top-5-machine-learning-movies
Chris Key
17 Oct 2017
3 min read
Save for later

Top 5 Machine Learning Movies

Chris Key
17 Oct 2017
3 min read
Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.
Read more
  • 0
  • 0
  • 4394

article-image-machine-learning-models
Packt
16 Aug 2017
8 min read
Save for later

Machine Learning Models

Packt
16 Aug 2017
8 min read
In this article by Pratap Dangeti, the author of the book Statistics for Machine Learning, we will take a look at ridge regression and lasso regression in machine learning. (For more resources related to this topic, see here.) Ridge regression and lasso regression In linear regression only residual sum of squares (RSS) are minimized, whereas in ridge and lasso regression, penalty applied (also known as shrinkage penalty) on coefficient values to regularize the coefficients with the tuning parameter λ. When λ=0 penalty has no impact, ridge/lasso produces the same result as linear regression, whereas λ => ∞ will bring coefficients to zero. Before we go in deeper on ridge and lasso, it is worth to understand some concepts on Lagrangian multipliers. One can show the preceding objective function into the following format, where objective is just RSS subjected to cost constraint (s) of budget. For every value of λ, there is some s such that will provide the equivalent equations as shown as follows for overall objective function with penalty factor: The following graph shows the two different Lagrangian format: Ridge regression works well in situations where the least squares estimates have high variance. Ridge regression has computational advantages over best subset selection which required 2P models. In contrast for any fixed value of λ, ridge regression only fits a single model and model-fitting procedure can be performed very quickly. One disadvantage of ridge regression is, it will include all the predictors and shrinks the weights according with its importance but it does not set the values exactly to zero in order to eliminate unnecessary predictors from models, this issue will be overcome in lasso regression. During the situation of number of predictors are significantly large, using ridge may provide good accuracy but it includes all the variables, which is not desired in compact representation of the model, this issue do not present in lasso as it will set the weights of unnecessary variables to zero. Model generated from lasso are very much like subset selection, hence it is much easier to interpret than those produced by ridge regression. Example of ridge regression machine learning model Ridge regression is machine learning model, in which we do not perform any statistical diagnostics on the independent variables and just utilize the model to fit on test data and check the accuracy of fit. Here we have used scikit-learn package: >>> from sklearn.linear_model import Ridge >>> wine_quality = pd.read_csv( >>> wine_quality.rename(columns=lambda x: x.replace(" ", inplace=True) >>> all_colnms = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', Article_01.png λ, ridge regression only fits a idge exactly to zero in egression. are significantly large, idge variables, which is not model, this issue do not present in lasso as it will e regression machine learning model learning model, in which we do not perform any statistical the model to fit on test data and have used scikit-learn package: csv("winequality-red.csv",sep=';') "_"), in ariables, asso odel earning el 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] >>> pdx = wine_quality[all_colnms] >>> pdy = wine_quality["quality"] >>> x_train,x_test,y_train,y_test = train_test_split(pdx,pdy,train_size = 0.7,random_state=42) Simple version of grid search from scratch has been described as follows, in which various values of alphas are tried to be tested in grid search to test the model fitness: >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] Initial values of R-squared are set to zero in order to keep track on its updated value and to print whenever new value is greater than exiting value: >>> initrsq = 0 >>> print ("nRidge Regression: Best Parametersn") >>> for alph in alphas: ... ridge_reg = Ridge(alpha=alph) ... ridge_reg.fit(x_train,y_train) 0 ... tr_rsqrd = ridge_reg.score(x_train,y_train) ... ts_rsqrd = ridge_reg.score(x_test,y_test) The following code always keep track on test R-squared value and prints if new value is greater than existing best value: >>> if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is is shown in the following screenshot: By looking into test R-squared (0.3513) value we can conclude that there is no significant relationship between independent and dependent variables. Also, please note that, the test R-squared value generated from ridge regression is similar to value obtained from multiple linear regression (0.3519), but with the no stress on diagnostics of variables, and so on. Hence machine learning models are relatively compact and can be utilized for learning automatically without manual intervention to retrain the model, this is one of the biggest advantages of using ML models for deployment purposes. The R code for ridge regression on wine quality data is shown as follows: # Ridge regression library(glmnet) wine_quality = read.csv("winequality-red.csv",header=TRUE,sep = ";",check.names = FALSE) names(wine_quality) <- gsub(" ", "_", names(wine_quality)) set.seed(123) numrow = nrow(wine_quality) trnind = sample(1:numrow,size = as.integer(0.7*numrow)) train_data = wine_quality[trnind,]; test_data = wine_quality[- trnind,] xvars = c("fixed_acidity","volatile_acidity","citric_acid","residual_sugar ","chlorides","free_sulfur_dioxide", "total_sulfur_dioxide","density","pH","sulphates","alcohol") yvar = "quality" x_train = as.matrix(train_data[,xvars]);y_train = as.double (as.matrix (train_data[,yvar])) x_test = as.matrix(test_data[,xvars]) print(paste("Ridge Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ ridge_fit = glmnet(x_train,y_train,alpha = 0,lambda = lmbd) pred_y = predict(ridge_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Example of lasso regression model Lasso regression is close cousin of ridge regression, in which absolute values of coefficients are minimized rather than square of values. By doing so, we eliminate some insignificant variables, which are very much compacted representation similar to OLS methods. Following implementation is almost similar to ridge regression apart from penalty application on mod/absolute value of coefficients: >>> from sklearn.linear_model import Lasso >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] >>> initrsq = 0 >>> print ("nLasso Regression: Best Parametersn") >>> for alph in alphas: ... lasso_reg = Lasso(alpha=alph) ... lasso_reg.fit(x_train,y_train) ... tr_rsqrd = lasso_reg.score(x_train,y_train) ... ts_rsqrd = lasso_reg.score(x_test,y_test) ... if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is shown in the following screenshot: Lasso regression produces almost similar results as ridge, but if we check the test R-squared values bit carefully, lasso produces little less values. Reason behind the same could be due to its robustness of reducing coefficients to zero and eliminate them from analysis: >>> ridge_reg = Ridge(alpha=0.001) >>> ridge_reg.fit(x_train,y_train) >>> print ("nRidge Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",ridge_reg.coef_[i]) >>> lasso_reg = Lasso(alpha=0.001) >>> lasso_reg.fit(x_train,y_train) >>> print ("nLasso Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",lasso_reg.coef_[i]) Following results shows the coefficient values of both the methods, coefficient of density has been set to o in lasso regression whereas density value is -5.5672 in ridge regression; also none of the coefficients in ridge regression are zero values: R Code – Lasso Regression on Wine Quality Data # Above Data processing steps are same as Ridge Regression, only below section of the code do change # Lasso Regression print(paste("Lasso Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ lasso_fit = glmnet(x_train,y_train,alpha = 1,lambda = lmbd) pred_y = predict(lasso_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Regularization parameters in linear regression and ridge/lasso regression Adjusted R-squared in linear regression always penalizes adding extra variables with less significance is one type of regularizing the data in linear regression, but it will adjust to unique fit of the model. Whereas in machine learning many parameters are adjusted to regularizing the overfitting problem, in the example of lasso/ridge regression penalty parameter (λ) to regularization, there are infinite values can be applied to regularize the model infinite ways: In overall there are many similarities between statistical way and machine learning ways of predicting the pattern. Summary We have seen ridge regression and lasso regression with their examples and we have also seen its regularization parameters. Resources for Article: Further resources on this subject: Machine Learning Review [article] Getting Started with Python and Machine Learning [article] Machine learning in practice [article]
Read more
  • 0
  • 0
  • 3492
article-image-introduction-latest-social-media-landscape-and-importance
Packt
14 Aug 2017
10 min read
Save for later

Introduction to the Latest Social Media Landscape and Importance

Packt
14 Aug 2017
10 min read
In this article by Siddhartha Chatterjee and Michal Krystyanczuk, author of the book, Python Social Media Analytics, starts with a question to you: Have you seen the movie Social Network? If you have not, it could be a good idea to see it before you read this. If you have, you may have seen the success story around Mark Zuckerberg and his company Facebook. This was possible due to power of the platform in connecting, enabling, sharing, and impacting the lives of almost two billion people on this planet. The earliest Social Networks existed as far back as 1995; such as Yahoo (Geocities), theglobe.com, and tripod.com. These platforms were mainly to facilitate interaction among people through chat rooms. It was only at the end of the 90s that user profiles became the in thing in social networking platforms, allowing information about people to be discoverable, and therefore, providing a choice to make friends or not. Those embracing this new methodology were Makeoutclub, Friendster, SixDegrees.com, and so on. MySpace, LinkedIn, and Orkut were thereafter created, and the social networks were on the verge of becoming mainstream. However, the biggest impact happened with the creation of Facebook in 2004; a total game changer for people's lives, business, and the world. The sophistication and the ease of using the platform made it into mainstream media for individuals and companies to advertise and sell their ideas and products. Hence, we are in the age of social media that has changed the way the world functions. Since the last few years, there have been new entrants in the social media, which are essentially of different interaction models as compared to Facebook, LinkedIn, or Twitter. These are Pinterest, Instagram, Tinder, and others. Interesting example is Pinterest, which unlike Facebook, is not centered around people but is centered around interests and/or topics. It's essentially able to structure people based on their interest around these topics. CEO of Pinterest describes it as a catalog of ideas. Forums which are not considered as regular social networks, such as Facebook, Twitter, and others, are also very important social platforms. Unlike in Twitter or Facebook, Forum users are often anonymous in nature, which enables them to make in-depth conversations with communities. Other non-typical social networks are video sharing platforms, such as YouTube and Dailymotion. They are non-typical because they are centered around the user-generated content, and the social nature is generated by the sharing of these content on various social networks and also the discussion it generates around the user commentaries. Social media is gradually changing from platform centric to more experiences and features. In the future, we'll see more and more traditional content providers and services becoming social in nature through sharing and conversations. The term social media today includes not just social networks but every service that's social in nature with a wide audience. Delving into Social Data The data acquired from social media is called social data. The social data exists in many forms. The types of social media data can be information around the users of social networks, like name, city, interests, and so on. These types of data that are numeric or quantifiable are known as structured data. However, since Social Media are platforms for expression, hence, a lot of the data is in the form of texts, images, videos, and such. These sources are rich in information, but not as direct to analyze as structured data described earlier. These types of data are known as unstructured data. The process of applying rigorous methods to make sense of the social data is called social data analytics. We will go into great depth in social data analytics to demonstrate how we can extract valuable sense and information from these really interesting sources of social data. Since there are almost no restrictions on social media, there are lot of meaningless accounts, content, and interactions. So, the data coming out of these streams is quite noisy and polluted. Hence, a lot of effort is required to separate the information from the noise. Once the data is cleaned and we are focused on the most important and interesting aspects, we then require various statistical and algorithmic methods to make sense out of the filtered data and draw meaningful conclusions. Understanding the process Once you are familiar with the topic of social media data, let us proceed to the next phase. The first step is to understand the process involved in exploitation of data present on social networks. A proper execution of the process, with attention to small details, is the key to good results. In many computer science domains, a small error in code will lead to a visible or at least correctable dysfunction, but in data science, it will produce entirely wrong results, which in turn will lead to incorrect conclusions. The very first step of data analysis is always problem definition. Understanding the problem is crucial for choosing the right data sources and the methods of analysis. It also helps to realize what kind of information and conclusions we can infer from the data and what is impossible to derive. This part is very often underestimated while it is key to successful data analysis. Any question that we try to answer in a data science project has to be very precise. Some people tend to ask very generic questions, such as I want to find trends on Twitter. This is not a correct problem definition and an analysis based on such statement can fail in finding relevant trends. By a naïve analysis, we can get repeating Twitter ads and content generated by bots. Moreover, it raises more questions than it answers. In order to approach the problem correctly, we have to ask in the first step: what is a trend? what is an interesting trend for us? and what is the time scope? Once we answer these questions, we can break up the problem in multiple sub problems: I'm looking for the most frequent consumer reactions about my brand on Twitter in English over the last week and I want to know if they were positive or negative. Such a problem definition will lead to a relevant, valuable analysis with insightful conclusions. The next part of the process consists of getting the right data according to the defined problem. Many social media platforms allow users to collect a lot of information in an automatized way via APIs (Application Programming Interfaces), which is the easiest way to complete the task. Once the data is stored in a database, we perform the cleaning. This step requires a precise understanding of the project's goals. In many cases, it will involve very basic tasks such as duplicates removal, for example, retweets on Twitter, or more sophisticated such as spam detection to remove irrelevant comments, language detection to perform linguistic analysis, or other statistical or machine learning approaches that can help to produce a clean dataset. When the data is ready to be analyzed, we have to choose what kind of analysis and structure the data accordingly. If our goal is to understand the sense of the conversations, then it only requires a simple list of verbatims (textual data), but if we aim to perform analysis on different variables, like number of likes, dates, number of shares, and so on, the data should be combined in a structure such as data frame, where each row corresponds to an observation and each column to a variable. The choice of the analysis method depends on the objectives of the study and the type of data. It may require statistical or machine learning approach, or a specific approach to time series. Different approaches will be explained on the examples of Facebook, Twitter, YouTube, GitHub, Pinterest, and Forum data. Once the analysis is done, it's time to infer conclusions. We can derive conclusions based on the outputs from the models, but one of the most useful tools is visualization technique. Data and output can be presented in many different ways, starting from charts, plots, and diagrams through more complex 2D charts, to multidimensional visualizations. Project planning Analysis of content on social media can get very confusing due to difficulty of working on large amount of data and also trying to make sense out of it. For this reason, it's extremely important to ask the right questions in the beginning to get the right answers. Even though this is an exploratory approach, and getting exact answers may be difficult, the right questions allow you to define the scope, process and the time. The main questions that we will be working on are the following : What does Google post on Facebook ? How do people react to Google Posts ? (Likes, Shares and Comments) What do Google's audience say about Google and its ecosystem? What are the emotions expressed by Google's audience ? With the preceding questions in mind we will proceed to the next steps. Scope and process The analysis will consist of analyzing the feed of posts and comments on official Facebook page of Google. The process of information extraction is organized in a data flow. It starts with data extraction from API, data preprocessing and wrangling and is followed by  a series of different analyses. The analysis becomes actionable only after the last step of results interpretation. In order to arrive at retrieving the above information we need to do the following : Extract all the posts of Google permitted by the Facebook API Extract the metadata for each posts : TimeStamp, Number of Likes, Number of Shares, Number of comments. Extract the user comments under each post and the metadata Process the posts to retrieve the most common keywords, bi-grams, hashtags Process the user comments using Alchemy API to retrieve the emotions Analyse the above information to derive conclusions Data type The main part of information extraction comes from an analysis of textual data (posts and comments). However, in order to add quantitative and temporal dimension, we process numbers (likes, shares) and dates (date of creation). Summary The avalanche of Social Network data is a result of communication platforms been developed since the last two decades. These are the platforms that evolved from chat rooms to personal information sharing and finally, social and professional networks. Among many Facebook, Twitter, Instagram, Pinterest and LinkedIn have emerged as the modern day Social Media. These platforms collectively have reach of more than a billon or more of individuals across the world sharing their activities and interaction with each other. Sharing of their data by these media through APIs and other technologies has given rise to a new field called Social Media Analytics. This has multiple applications such as in Marketing, Personalized recommendations, Research and Societal. Modern Data Science techniques such as Machine Learning and Text Mining are widely used for these applications. Python is one of the most widely used programming languages used for these techniques. However, manipulating the unstructured-data from Social Networks requires a lot of precise processing and preparation before coming to the most interesting bits.  Resources for Article:  Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media for Wordpress: VIP Memberships [article] Social Media Insight Using Naive Bayes [article]
Read more
  • 0
  • 0
  • 2192

article-image-understanding-sap-analytics-cloud
Packt
10 Aug 2017
14 min read
Save for later

Understanding SAP Analytics Cloud

Packt
10 Aug 2017
14 min read
In this article, Riaz Ahmed, the author of the book Learning SAP Analytics Cloud provides an overview of this unique cloud-based business intelligence platform. You will learn about the following SAP Analytics Cloud segments Models and data sources Visualization Collaboration Presentation Administration and security (For more resources related to this topic, see here.) What is SAP Analytics Cloud? SAP Analytics Cloud is a new generation cloud-based application, which helps you explore your data, perform visualization and analysis, create financial plans, and produce predictive forecasting. It is a one-stop-shop solution to cope with your analytic needs and comprises business intelligence, planning, predictive analytics, and governance and risk. The application is built on the SAP Cloud Platform and delivers great performance and scalability. In addition to the on-premise SAP HANA, SAP BW, and S/4HANA sources, you can work with data from a wide range of non-SAP sources including Google Drive, Salesforce, SQL Server, Concur, and CSV to name a few. SAP Analytics Cloud allows you to make secure connections to these cloud and on premise data sources. Anatomy of SAP Analytics Cloud The following figure depicts the anatomy of SAP Analytics Cloud: Data sources and models Before commencing your analytical tasks in SAP Analytics Cloud, you need to create  models. Models are the basis for all of  your analysis in SAP Analytics Cloud  to evaluate the performance of your organization. It is a high-level design that exposes the analytic requirements of end users. You can create planning and analytics models based on the cloud or on premise data sources. Analytics models are more simpler and flexible, while planning models are full featured models in which business analysts and finance professionals can quickly and easily build connected models to analyze data and then collaborate with each other to attain better business performance. Preconfigured with dimensions for Time and Categories, planning models support for multi-currency and security features at both model and dimension levels. After creating these models you can share it with other users in your organization. Before sharing, you can set up model access privileges for users according to their level of authorization and can also enable data auditing. With the help of SAP Analytics Cloud's analytical capabilities, users can discover hidden traits in the data and can predict likely outcomes. It equips them with the ability to uncover potentials risks and hidden opportunities. To determine what content to include in your model, you must first identify the columns from the source data on which users need to query. The columns you need in your model reside in some sort of data source. SAP Analytics Cloud supports three types of data sources: files (such as CSV or Excel files) that usually reside on your computer, live data connection from a connected remote system, and  cloud apps. In addition to the files on your computer, you can use on-premise data sources such as SAP Business Warehouse, SAP ERP, SAP Universe, SQL database, and more to acquire data for your models. In the cloud, you can get data from apps like Concur, Google Drive, SAP Business ByDesign, SAP Hybris Cloud, OData Services, and Success Factors. The following figure depicts these data sources. The cloud apps data sources you can use with SAP Analytics Cloud are displayed above the firewall mark, while those in your local network are shown under the firewall.As you can see in this figure, there are over twenty data sources currently supported by SAP Analytics Cloud.  The method of connecting to these data sources also vary from each other. Create a direct live connection to SAP HANA You can connect to on-premise SAP HANA system to use live data in SAP Analytics Cloud. Live data means that you can get up-to-the-minute data when you open a story in SAP Analytics Cloud.In this case, any changes made to the data in the source system are reflected immediately. Usually, there are two ways to establish a connection to a data source – use  the Connection option from the main menu, or specify the data source during the process of creating a model. However, live data connections must be established via the Connection menu option prior to creating the corresponding model. Connect remote systems to import data In addition to creating live connections, you can also create connections that allow you to import data into SAP Analytics Cloud. In these types of connections that you make to access remote systems, data is imported (copied) to SAP Analytics Cloud. Any changes users make in the source data do not affect the imported data. To establish connections with these remote systems, you need to install some additional components. For example, you must install SAP HANA Cloud Connector to access SAP Business Planning and Consolidation (BPC) for Netweaver. Similarly, SAP Analytics Cloud Agent should be installed for SAP Business Warehouse (BW), SQL Server, SAP ERP, and others. Connect to a cloud app to import data In addition to creating a live connection to create a model on live data and importing data from remote systems, you can set up connections to acquire data from cloud apps, such as Google Drive, SuccessFactors, Odata, Concur, Fieldglass, Google BigQuery, and more. Refreshing imported data SAP Analytics Cloud allows you to refresh your imported data. With this option, you can reimport the data on demand to get the latest values. You can perform this refresh operation either manually, or create an import schedule to refresh the data at a specific date and time or on a recurring basis. Visualization Once you have created a model and set up appropriate security for it, you can create stories in which the underlying model data can be explored and visualized with the help of different types of charts, geo maps, and tables. There is a wide range of charts you can add to your story pages to address different scenarios. You can create multiple pages in your story to present your model data using charts, geo maps, tables, text, shapes, and images. On your story pages, you can link dimensions to present data from multiple sources. Adding reference lines and thresholds, applying filters, and drilling down into data can be done on-the-fly. The ability to interactively drag and drop page objects is the most useful feature of this application. Charts: SAP Analytics Cloud comes with a variety of charts to present your analysis according to your specific needs. You can add multiple types of charts to a single story page. Geo Map: The models that include latitude and longitude information can be used in stories to visualize data in geo maps. By adding multiple layers of different types of data in geo maps, you can show different geographic features and point of interests enabling you to perform sophisticated geographic analysis. Table: A table is a spreadsheet like object that can be used to view and analyze text data. You can add this object to either canvas or grid pages in stories. Static and Dynamic Text: You can add static and dynamic text to your story pages. Static text are normally used to display page titles, while dynamic text automatically updates page headings based on the values from the source input control or filter. Images and Shapes: You can add images (such as your company logo) to your story page by uploading them from your computer. In addition to images, you can also add shapes such as line, square, or circle to your page. Collaboration Collaboration, alert, and notification features of SAP Analytics Cloud keep business users in touch with each other while executing their tasks.During the process of creating models and stories, you need input from other people. For example, you might ask a colleague to update a model by providing first quarter sales data, or request Sheela to enter her comments on one of your story pages. In SAP Analytics Cloud, these interactions come under the collaboration features of the application. Using these features users can discuss business content and share information that consequently smoothens the decision making process. Here is a list of available collaboration features in the application that allow group members to discuss stories and other business content. Create a workflow using events and tasks: The events and tasks features in SAP Analytics Cloud are the two major sources that help you collaborate with other group members and  manage your planning and analytic activities. After creating an event and assigning tasks to relevant group members, you can monitor the task progress in the Events interface. Here is the workflow to utilize these two features: Create events based on categories and processes within categories Create a task, assign it to users, and set a due date for its submission Monitor the task progress Commenting on a chart's data point: Using this feature, you can add annotations or additional information to individual data points in a chart. Commenting on a story page: In addition to adding comments to an individual chart, you have the option to add comments on an entire story page to provide some vital information to other users. When you add a comment, other users can see and reply to it. Produce a story as a PDF: You can save your story in a PDF file to share it with other users and for offline access. You can save all story pages or a specific page as a PDF file. Sharing a story with colleagues: Once you complete a story, you can share it with other members in your organization. You are provided with three options (Public, Teams, and Private) when you save a story. When you save your story in the Public folder, it can be accessed by anyone. The Teams option lets you select specific teams to share your story with. In the Private option, you have to manually select users with whom you want to share your story. Collaborate via discussions: You can collaborate with colleagues using the discussions feature of SAP Analytics Cloud. The discussions feature enables you to connect with other members in real-time. Sharing files and other objects: The Files option under the Browse menu allows you to access SAP Analytics Cloud repository, where the stories you created and the files you uploaded are stored. After accessing the Files page, you can share its objects with other users. On the Files page, you can manage files and folders, upload files, share files, and change share settings. Presentation In the past, board meetings were held in which the participants used to deliver their reports through their laptops and a bunch of papers. With different versions of reality, it was very difficult for decision makers to arrive at a good decision. With the advent of SAP Digital Boardroom the board meetings have revolutionized. It is a next-generation presentation platform, which helps you visualize your data and plan ahead in a real-time environment. It runs on multiple screens simultaneously displaying information from difference sources. Due to this capability, more people can work together using a single dataset that consequently creates one version of reality to make the best decision. SAP Digital Boardroom is changing the landscape of board meetings. In addition to supporting traditional presentation methods, it goes beyond the corporate boardroom and allows remote members to join and participate the meeting online. These remote participants can actively engage in the meeting and can play with live data using their own devices. SAP Digital Boardroom is a visualization and presentation platform that enormously assists in the decision-making process. It transforms executive meetings by replacing static and stale presentations with interactive discussions based on live data, which allows them to make fact-based decisions to drive their business. Here are the main benefits of SAP Digital Boardroom: Collaborate with others in remote locations and on other devices in an interactive meeting room. Answer ad hoc questions on the fly. Visualize, recognize, experiment, and decide by jumping on and off script at any point. Find answers to the questions that matter to you by exploring directly on live data and focusing on relevant aspects by drilling into details. Discover opportunities or reveal hidden threats. Simulate various decisions and project the results. Weigh and share the pros and cons of your findings. The Digital Boardroom is interactive so you can retrieve real-time data, make changes to the schedule and even run through what-if scenarios. It presents a live picture of your organization across three interlinked touch screens to make faster, better executive decisions. There are two aspects of Digital Boardroom with which you can share existing stories with executives and decision-makers to reveal business performance. First, you have to design your agenda for your boardroom presentation by adding meeting information, agenda items, and linking stories as pages in a navigation structure. Once you have created a digital boardroom agenda, you can schedule a meeting to discuss the agenda. In Digital Boardroom interface, you can organize a meeting in which members can interact with live data during a boardroom presentation. Administration andsecurity An application that is accessed by multiple users is useless without a proper system administration module. The existence of this module ensures that things are under control and secured. It also includes upkeep, configuration, and reliable operation of the application. This module is usually assigned to a person who is called system administrator and whose responsibility is to watch uptime, performance, resources, and security of the application. Being a multi-user application, SAP Analytics Cloud also comes with this vital module that allows a system administrator to take care of the following segments: Creating users and setting their passwords Importing users from another data source Exporting user's profiles for other apps Deleting unwanted user accounts from the system Creating roles and assigning them to users Setting permissions for roles Forming teams Setting security for models Monitoring users activities Monitoring data changes Monitoring system performance System deployment via export and import Signing up for trial version If you want to get your feet wet by exploring this exciting cloud application, then sign up for a free 30 days trial version. Note that the trial version doesn't allow you to access all the features of SAP Analytics Cloud. For example, you cannot create a planning model in the trial version nor can you access its security and administration features. Execute the following steps to get access to the free SAP Analytics Cloud trial: Put the following URL in you browser's address bar and press enter: http://discover.sapanalytics.cloud/trialrequest-auto/ Enter and confirm your business e-mail address in relevant boxes. Select No from the Is your company an SAP Partner? list. Click the Submit button. After a short while, you will get an e-mail with a link to connect to SAP Analytics Cloud. Click the Activate Account button in the e-mail. This will open the Activate Your Account page, where you will have to set a strong password. The password must be at least 8 characters long and should also include uppercase and lowercase letters, numbers, and symbols. After entering and confirming your password, click the Save button to complete the activation process. The confirmation page appears telling you that your account is successfully activated. Click Continue. You will be taken to SAP Analytics Cloud site. The e-mail you receive carries a link under SAP Analytics Cloud System section that you can use to access the application any time. Your username (your e-mail address) is also mentioned in the same e-mail along with a log on button to access the application. Summary SAP Analytics Cloud is the next generation cloud-based analytic application, which provides an end-to-end cloud analytics experience. SAP Analytics Cloud can help transform how you discover, plan, predict, collaborate, visualize, and extend all in one solution. In addition to on-premise data sources, you can fetch data from a variety of other cloud apps and even from Excel and text files to build your data models and then create stories based on these models. The ultimate purpose of this amazing and easy-to-use application is to enable you to make the right decision. SAP Analytics Cloud is more than visualization of data it is insight to action, it is realization of success. Resources for Article: Further resources on this subject: Working with User Defined Values in SAP Business One [article] Understanding Text Search and Hierarchies in SAP HANA [article] Meeting SAP Lumira [article]
Read more
  • 0
  • 0
  • 9861

article-image-machine-learning-review
Packt
18 Jul 2017
20 min read
Save for later

Machine Learning Review

Packt
18 Jul 2017
20 min read
In this article by Uday Kamath and Krishna Choppella, authors for the book Mastering Java Machine Learning, will discuss how in recent years a revival of interest is seen in the area ofartificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years. (For more resources related to this topic, see here.) What made these successes possible, in large part, was the availability of prodigious amounts of data and the inexorable increase in raw computational power. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business, and at the same time, its enormous success in improving the accuracy of what are now everyday applications, such assearch, speech recognition, and personal assistants on mobile phones,has made its effects commonplace in the family room and the boardroom alike. Articles breathlessly extolling the power of "deep learning" can be found today not only in the popular science and technology press, but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time. An ordinary user encounters machine learning in many ways in his day-to-day activities. Interacting with well-known e-mail providers such as Gmail gives the user automated sorting and categorization of e-mails into categories, such as spam, junk, promotions, and so on,which is made possible using text mining, a branch of machine learning. When shopping online for products on ecommerce websites such as https://www.amazon.com/ or watching movies from content providers such as http://netflix.com/, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning. Forecasting the weather, estimating real estate prices, predicting voter turnout and even election results—all use some form of machine learningto see into the future as it were. The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on skills from a limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, including Java, Python, R, and increasingly, Scala. By far, the number and availability of machine learning libraries, tools, APIs, and frameworks in Java outstrip those in other languages. Consequently, mastery of these skills will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace. Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject.Clearly, you can bend Java to your will, but now you feel you're ready to dig deeper and learn how to use thebest of breed open-source ML Java frameworks in your next data science project. Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a project that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material.To warm up before we embark on sharpening our instrument, we will devote this article to a quick review of what we already know.For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this article), here's our advice: make sure you do not skip the rest of this article instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary.Wikipedia it. Then jump right back in. For the rest of this article, we will review the following: History and definitions What is not machine learning? Concepts and terminology Important branches of machine learning Different data types in machine learning Applications of machine learning Issues faced in machine learning The meta-process used in most machine learning projects Information on some well-known tools, APIs,and resources that we will employ in this article Machine learning –history and definition It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as in the 1860s.In Rene Descartes' Discourse on the Method, he refers to Automata and saysthe following: For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pd https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm Alan Turing, in his famous publication Computing Machinery and Intelligence, gives basic insights into the goals of machine learning by asking the question "Can machines think?". http://csmt.uchicago.edu/annotations/turing.htm http://www.csee.umbc.edu/courses/471/papers/turing.pdf Arthur Samuel, in 1959,wrote,"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.". Tom Mitchell, in recent times, gave a more exact definition of machine learning:"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." Machine Learning has a relationship with several areas: Statistics: This uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical based modeling, to name a few Algorithms and computation: This uses basics of search, traversal, parallelization, distributed computing, and so on from basic computer science Database and knowledge discovery: This has the ability to store, retrieve, access information in various formats Pattern recognition: This has the ability to find interesting patterns from the data either to explore, visualize, or predict Artificial Intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on. What is not machine learning? It is important to recognize areas that share a connection with machine learning but cannot themselves be considered as being part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct: Business intelligence and reporting: Reporting Key Performance Indicators (KPIs), querying OLAP for slicing, dicing, and drilling into the data, dashboards,and so on. that form central components of BI are not machine learning. Storage and ETL: Data storage and ETL are key elements needed in any machine learning process, but by themselves, they don't qualify as machine learning. Information retrieval, search, and queries: The ability to retrieve the data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on search of similar data for modeling but that doesn't qualify search as machine learning. Knowledge representation and reasoning: Representing knowledge for performing complex tasks such as Ontology, Expert Systems, and Semantic Web do not qualify as machine learning. Machine learning –concepts and terminology In this article, we will describe different concepts and terms normally used in machine learning: Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for using in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can be also be categorized by size; small or medium data have a few hundreds to thousands of instances, whereas big data refers to large volume, mostly in the millions or billions, which cannot be stored or accessed using common devices or fit in the memory of such devices. Features, attributes, variables or dimensions: In structured datasets, as mentioned earlier, there are predefined elements with their own semantic and data type, which are known variously as features, attributes, variables, or dimensions. Data types: The preceding features defined need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows: Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color, such as black, blue, brown, green, or grey; document content type, such as text, image, or video. Continuous or numeric: This indicates the numeric nature of the data field. For example, a person's weight measured by a bathroom scale, temperature from a sensor, the monthly balance in dollars on a credit card account. Ordinal: This denotes the data that can be ordered in some way. For example, garment size, such as small, medium, or large; boxing weight classes, such as heavyweight, light heavyweight, middleweight, lightweight,and bantamweight. Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in unseen dataset, is known as a target or a label. A label can have any form as specified earlier, that is, categorical, continuous, or ordinal. Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model. Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain,for example,all people eligible to vote in the general election, all potential automobile owners in the next four years.Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purposes of analysis.A crucial consideration in the sampling process is that the sample be unbiased with respect to the population. The following are types of probability based sampling: Uniform random sampling: A sampling method when the sampling is done over a uniformly distributed population, that is, each object has an equal probability of being chosen. Stratified random sampling: A sampling method when the data can be categorized into multiple classes.In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population.An example is data that spans many geographical regions. In cluster sampling you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample.This kind of sampling can reduce costs of data collection without compromising the fidelity of distribution in the population. Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of object before selecting the next one, that is called systematic sampling.K is calculated as the ratio of the population and the sample size. Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio,and so on. In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics. In stream-based learning apart from preceding standard metrics mentioned, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner. To illustrate these concepts, a concrete example in the form of a well-known weather dataset is given.The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no The dataset is in the format of an ARFF (Attribute-Relation File Format) file. It consists of a header giving the information about features or attributes with their data types and actual comma separated data following the data tag. The dataset has five features: outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical. Machine learning –types and subtypes We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types: Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled data. If the data type of the label is categorical, it becomes a classification problem, and if numeric, it becomes a regression problem. For example, if the target of the dataset is detection of fraud, which has categorical values of either true or false, we are dealing with a classification problem. If, on the other hand, the target is to predict thebest price to list the sale of a home at, which is a numeric dollar value, the problem is one of regression. The following diagram illustrates labeled data that is conducive to classification techniques that are suitable for linearly separable data, such as logistic regression: Linearly separable data An example of dataset that is not linearly separable. This type of problem calls for classification techniques such asSupport Vector Machines. Unsupervised learning: Understanding the data and exploring it in order to buildmachine learning models when the labels are not given is called unsupervised learning. Clustering, manifold learning, and outlier detection are techniques that are covered in this topic. Examples of problems that require unsupervised learning are many; grouping customers according to their purchasing behavior is one example.In the case of biological data, tissues samples can be clustered based on similar gene expression values using unsupervised learning techniques The following diagram represents data with inherent structure that can be revealed as distinct clusters using an unsupervised learning technique such as K-Means: Clusters in data Different techniques are used to detect global outliers—examples that are anomalous with respect to the entire data set, and local outliers—examples that are misfits in their neighborhood. In the following diagam, the notion of local and global outliers is illustrated for a two-feature dataset: Local and Global outliers Semi-supervised learning: When the dataset has some labeled data and large data, which is not labeled, learning from such dataset is called semi-supervised learning. When dealing with financial data with the goal to detect fraud, for example, there may be a large amount of unlabeled data and only a small number of known fraud and non-fraud transactions.In such cases, semi-supervised learning may be applied. Graph mining: Mining data represented as graph structures is known as graph mining. It is the basis of social network analysis and structure analysis in different bioinformatics, web mining, and community mining applications. Probabilistic graphmodeling and inferencing: Learning and exploiting structures present between features to model the data comes under the branch of probabilistic graph modeling. Time-series forecasting: This is a form of learning where data has distinct temporal behavior and the relationship with time is modeled.A common example is in financial forecasting, where the performance of stocks in a certain sector may be the target of the predictive model. Association analysis:This is a form of learning where data is in the form of an item set or market basket and association rules are modeled to explore and predict the relationships between the items. A common example in association analysis is to learn relationships between the most common items bought by the customers when they visit the grocery store. Reinforcement learning: This is a form of learning where machines learn to maximize performance based on feedback in the form of rewards or penalties received from the environment. A recent example that famously used reinforcement learning was AlphaGo, the machine developed by Google that beat the world Go champion Lee Sedoldecisively, in March 2016.Using a reward and penalty scheme, the model first trained on millions of board positions in the supervised learning stage, then played itselfin the reinforcement learning stage to ultimately become good enough to triumph over the best human player. http://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/ https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf Stream learning or incremental learning: Learning in supervised, unsupervised, or semi-supervised manner from stream data in real time or pseudo-real time is called stream or incremental learning. Learning the behaviors of sensors from different types of industrial systems for categorizing into normal and abnormal needs real time feed and detection. Datasets used in machine learning To learn from data, we must be able to understand and manage data in all forms.Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all.In this section, we present a high level classification of datasets with commonly occurring examples. Based on structure, dataset may be classified as containing the following: Structured or record data: Structured data is the most common form of dataset available for machine learning. The data is in theform of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available mostly in flat files or relational databases. The records of financial transactions at a bank shown in the following screenshotare an example of structured data: Financial Card Transactional Data with labels of Fraud. Transaction or market data: This is a special form of structured data whereeach corresponds to acollection of items. Examples of market dataset are the list of grocery item purchased by different customers, or movies viewed by customers as shown in the following screenshot: Market Dataset for Items bought from grocery store. Unstructured data: Unstructured data is normally not available in well-known formats such as structured data. Text data, image, and video data are different formats of unstructured data. Normally, a transformation of some form is needed to extract features from these forms of data to the aforementioned structured datasets so that traditional machine learning algorithms can be applied: Sample Text Data from SMS with labels of spam and ham from by Tiago A. Almeida from the Federal University of Sao Carlos. Sequential data: Sequential data have an explicit notion of order to them. The order can be some relationship between features and time variable in time series data, or symbols repeating in some form in genomic datasets. Two examples are weather data and genomic sequence data. The following diagram shows the relationship between time and the sensor level for weather: Time Series from Sensor Data Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all the three genomic sequences: Genomic Sequences of DNA as sequence of symbols. Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in structured record format or unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containingrelevant claims details withclaimants related through addresses/phonenumbers,and so on.This can be viewed in graph structure. Using theWorld Wide Web as an example, we have web pages available as unstructured datacontaininglinks,and graphs of relationships between web pages that can be built using web links, producing some of the most mined graph datasets today: Insurance Claim Data, converted into graph structure with relationship between vehicles, drivers, policies and addresses Machine learning applications Given the rapidly growing use of machine learning in diverse areas of human endeavor, any attempt to list typical applications in the different industries, where some form of machine learning is in use,must necessarily be incomplete. Nevertheless, in this section we list a broad set of machine learning applications by domain, uses and the type of learning used: Domain/Industry Applications Machine Learning Type Financial Credit Risk Scoring, Fraud Detection, Anti-Money Laundering Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Web Online Campaigns, Health Monitoring, Ad Targeting Supervised, Unsupervised, Semi-Supervised Healthcare Evidence-based Medicine, Epidemiological Surveillance, Drug Events Prediction, Claim Fraud Detection Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Internet of Thing (IoT) Cyber Security, Smart Roads, Sensor Health Monitoring Supervised, Unsupervised, Semi-Supervised, and Stream Learning Environment Weather forecasting, Pollution modeling, Water quality measurement Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Retail Inventory, Customer Management and Recommendations, Layout and Forecasting Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Summary: A revival of interest is seen in the area of artificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. The use of machine learning  is to help in complex decision making at the highest levels of business. It has also achieved enormous success in improving the accuracy of everyday applications, such as search, speech recognition, and personal assistants on mobile phones. The basics of machine learning rely on understanding of data.Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Machine learning is of two types: Supervised learning is the popular branch of machine learning, which is about learning from labeled data and Unsupervised learning is understanding the data and exploring it in order to build machine learning models when the labels are not given.  Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine learning in practice [article] Introduction to Machine Learning with R [article]
Read more
  • 0
  • 0
  • 17953
article-image-and-running-pandas
Packt
18 Jul 2017
15 min read
Save for later

Up and Running with pandas

Packt
18 Jul 2017
15 min read
In this article by Michael Heydt, author of the book Learning Pandas - Second Edition, we will cover how to install pandas and start using its basic functionality.  The content is provided as IPython and Jupyter notebooks, and hence we will also take a quick look at using both of those tools. We will utilize the Anaconda Scientific Python distribution from Continuum. Anaconda is a popular Python distribution with both free and paid components. Anaconda provides cross-platform support—including Windows, Mac, and Linux. The base distribution of Anaconda installs pandas, IPython and Jupyter notebooks, thereby making it almost trivial to get started. (For more resources related to this topic, see here.) IPython and Jupyter Notebook So far we have executed Python from the command line / terminal. This is the default Read-Eval-Print-Loop (REPL)that comes with Python. Let’s take a brief look at both. IPython IPython is an alternate shell for interactively working with Python. It provides several enhancements to the default REPL provided by Python. If you want to learn about IPython in more detail check out the documentation at https://ipython.org/ipython-doc/3/interactive/tutorial.html To start IPython, simply execute the ipython command from the command line/terminal. When started you will see something like the following: The input prompt which shows In [1]:. Each time you enter a statement in the IPythonREPL the number in the prompt will increase. Likewise, output from any particular entry you make will be prefaced with Out [x]:where x matches the number of the corresponding In [x]:.The following demonstrates: This numbering of in and out statements will be important to the examples as all examples will be prefaced with In [x]:and Out [x]: so that you can follow along. Note that these numberings are purely sequential. If you are following through the code in the text and occur errors in input or enter additional statements, the numbering may get out of sequence (they can be reset by exiting and restarting IPython).  Please use them purely as reference. Jupyter Notebook Jupyter Notebook is the evolution of IPython notebook. It is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and markdown. The original IPython notebook was constrained to Python as the only language. Jupyter Notebook has evolved to allow many programming languages to be used including Python, R, Julia, Scala, and F#. If you want to take a deeper look at Jupyter Notebook head to the following URL http://jupyter.org/: where you will be presented with a page similar to the following: Jupyter Notebookcan be downloaded and used independently of Python. Anaconda does installs it by default. To start a Jupyter Notebookissue the following command at the command line/Terminal: $ jupyter notebook To demonstrate, let's look at how to run the example code that comes with the text. Download the code from the Packt website and and unzip the file to a directory of your choosing. In the directory, you will see the following contents: Now issue the jupyter notebook command. You should see something similar to the following: A browser page will open displaying the Jupyter Notebook homepage which is http://localhost:8888/tree. This will open a web browser window showing this page, which will be a directory listing: Clicking on a .ipynb link opens a notebook page like the following: The notebook that is displayed is HTML that was generated by Jupyter and IPython.  It consists of a number of cells that can be one of 4 types: code, markdown, raw nbconvert or heading. Jupyter runs an IPython kernel for each notebook.  Cells that contain Python code are executed within that kernel and the results added to the notebook as HTML. Double-clicking on any of the cells will make the cell editable.  When done editing the contents of a cell, press shift-return,  at which point Jupyter/IPython will evaluate the contents and display the results. If you wan to learn more about the notebook format that underlies the pages, see https://ipython.org/ipython-doc/3/notebook/nbformat.html. The toolbar at the top of a notebook gives you a number of abilities to manipulate the notebook. These include adding, removing, and moving cells up and down in the notebook. Also available are commands to run cells, rerun cells, and restart the underlying IPython kernel. To create a new notebook, go to File > New Notebook > Python 3 menu item. A new notebook page will be created in a new browser tab. Its name will be Untitled: The notebook consists of a single code cell that is ready to have Python entered.  EnterIPython1+1 in the cell and press Shift + Enter to execute. The cell has been executed and the result shown as Out [2]:.  Jupyter also opened a new cellfor you to enter more code or markdown. Jupyter Notebook automatically saves your changes every minute, but it's still a good thing to save manually every once and a while. One final point before we look at a little bit of pandas. Code in the text will be in the format of  command -line IPython. As an example, the cell we just created in our notebook will be shown as follows: In [1]: 1+1 Out [1]: 2 Introducing the pandas Series and DataFrame Let’s jump into using some pandas with a brief introduction to pandas two main data structures, the Series and the DataFrame. We will examine the following: Importing pandas into your application Creating and manipulating a pandas Series Creating and manipulating a pandas DataFrame Loading data from a file into a DataFrame The pandas Series The pandas Series is the base data structure of pandas. A Series is similar to a NumPy array, but it differs by having an index which allows for much richer lookup of items instead of just a zero-based array index value. The following creates a Series from a Python list.: In [2]: # create a four item Series s = Series([1, 2, 3, 4]) s Out [2]:0 1 1 2 2 3 3 4 dtype: int64 The output consists of two columns of information.  The first is the index and the second is the data in the Series. Each row of the output represents the index label (in the first column) and then the value associated with that label. Because this Series was created without specifying an index (something we will do next), pandas automatically creates an integer index with labels starting at 0 and increasing by one for each data item. The values of a Series object can then be accessed through the using the [] operator, paspassing the label for the value you require. The following gets the value for the label 1: In [3]:s[1] Out [3]: 2 This looks very much like normal array access in many programming languages. But as we will see, the index does not have to start at 0, nor increment by one, and can be many other data types than just an integer. This ability to associate flexible indexes in this manner is one of the great superpowers of pandas. Multiple items can be retrieved by specifying their labels in a python list. The following retrieves the values at labels 1 and 3: In [4]: # return a Series with the row with labels 1 and 3 s[[1, 3]] Out [4]: 1 2 3 4 dtype: int64 A Series object can be created with a user-defined index by using the index parameter and specifying the index labels. The following creates a Series with the same valuesbut with an index consisting of string values: In [5]:# create a series using an explicit index s = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd']) s Out [5}:a 1 b 2 c 3 d 4 dtype: int64 Data in the Series object can now be accessed by those alphanumeric index labels. The following retrieves the values at index labels 'a' and 'd': In [6]:# look up items the series having index 'a' and 'd' s[['a', 'd']] Out [6]:a 1 d 4 dtype: int64 It is still possible to refer to the elements of this Series object by their numerical 0-based position. : In [7]:# passing a list of integers to a Series that has # non-integer index labels will look up based upon # 0-based index like an array s[[1, 2]] Out [7]:b 2 c 3 dtype: int64 We can examine the index of a Series using the .index property: In [8]:# get only the index of the Series s.index Out [8]: Index(['a', 'b', 'c', 'd'], dtype='object') The index is itself actually a pandas objectand this output shows us the values of the index and the data type used for the index. In this case note that the type of the data in the index (referred to as the dtype) is object and not string. A common usage of a Series in pandas is to represent a time-series that associates date/time index labels withvalues. The following demonstrates by creating a date range can using the pandas function pd.date_range(): In [9]:# create a Series who's index is a series of dates # between the two specified dates (inclusive) dates = pd.date_range('2016-04-01', '2016-04-06') dates Out [9]:DatetimeIndex(['2016-04-01', '2016-04-02', '2016- 04-03', '2016-04-04', '2016-04-05', '2016-04-06'], dtype='datetime64[ns]', freq='D') This has created a special index in pandas referred to as a DatetimeIndex which is a specialized type of pandas index that is optimized to index data with dates and times. Now let's create a Series using this index. The data values represent high-temperatures on those specific days: In [10]:# create a Series with values (representing temperatures) # for each date in the index temps1 = Series([80, 82, 85, 90, 83, 87], index = dates) temps1 Out [10]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, dtype: int64 This type of Series with a DateTimeIndexis referred to as a time-series. We can look up a temperature on a specific data by using the date as a string: In [11]:temps1['2016-04-04'] Out [11]:90 Two Series objects can be applied to each other with an arithmetic operation. The following code creates a second Series and calculates the difference in temperature between the two: In [12]:# create a second series of values using the same index temps2 = Series([70, 75, 69, 83, 79, 77], index = dates) # the following aligns the two by their index values # and calculates the difference at those matching labels temp_diffs = temps1 - temps2 temp_diffs Out [12]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 The result of an arithmetic operation (+, -, /, *, …) on two Series objects that are non-scalar values returns another Series object. Since the index is not integerwe can also then look up values by 0-based value: In [13]:# and also possible by integer position as if the # series was an array temp_diffs[2] Out [13]: 16 Finally, pandas provides many descriptive statistical methods. As an example, the following returns the mean of the temperature differences: In [14]: # calculate the mean of the values in the Serie temp_diffs Out [14]: 9.0 The pandas DataFrame A pandas Series can only have a single value associated with each index label. To have multiple values per index label we can use a DataFrame. A DataFrame represents one or more Series objects aligned by index label.  Each Series will be a column in the DataFrame, and each column can have an associated name.— In a way, a DataFrame is analogous to a relational database table in that it contains one or more columns of data of heterogeneous types (but a single type for all items in each respective column). The following creates a DataFrame object with two columns and uses the temperature Series objects: In [15]:# create a DataFrame from the two series objects temp1 # and temp2 and give them column names temps_df = DataFrame( {'Missoula': temps1, 'Philadelphia': temps2}) temps_df Out [15]: Missoula Philadelphia 2016-04-01 80 70 2016-04-02 82 75 2016-04-03 85 69 2016-04-04 90 83 2016-04-05 83 79 2016-04-06 87 77 The resulting DataFrame has two columns named Missoula and Philadelphia. These columns are new Series objects contained within the DataFrame with the values copied from the original Series objects. Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column: In [16]:# get the column with the name Missoula temps_df['Missoula'] Out [16]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 And the following code retrieves the Philadelphia column: In [17]:# likewise we can get just the Philadelphia column temps_df['Philadelphia'] Out [17]:2016-04-01 70 2016-04-02 75 2016-04-03 69 2016-04-04 83 2016-04-05 79 2016-04-06 77 Freq: D, Name: Philadelphia, dtype: int64 A Python list of column names can also be used to return multiple columns.: In [18]:# return both columns in a different order temps_df[['Philadelphia', 'Missoula']] Out [18]: Philadelphia Missoula 2016-04-01 70 80 2016-04-02 75 82 2016-04-03 69 85 2016-04-04 83 90 2016-04-05 79 83 2016-04-06 77 87 There is a subtle difference in a DataFrame object as compared to a Series object. Passing a list to the [] operator of DataFrame retrieves the specified columns whereas aSerieswould return rows. If the name of a column does not have spaces it can be accessed using property-style: In [19]:# retrieve the Missoula column through property syntax temps_df.Missoula Out [19]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 Arithmetic operations between columns within a DataFrame are identical in operation to those on multiple Series. To demonstrate, the following code calculates the difference between temperatures using property notation: In [20]:# calculate the temperature difference between the two #cities temps_df.Missoula - temps_df.Philadelphia Out [20]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following adds a new column in the DataFrame with the temperature differences: In [21]:# add a column to temp_df which contains the difference # in temps temps_df['Difference'] = temp_diffs temps_df Out [21]: Missoula Philadelphia Difference 2016-04-01 80 70 10 2016-04-02 82 75 7 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 The names of the columns in a DataFrame are accessible via the.columns property. : In [22]:# get the columns, which is also an Index object temps_df.columns Out [22]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') The DataFrame (and Series) objects can be sliced to retrieve specific rows. The following slices the second through fourth rows of temperature difference values: In [23]:# slice the temp differences column for the rows at # location 1 through 4 (as though it is an array) temps_df.Difference[1:4] Out [23]: 2016-04-02 7 2016-04-03 16 2016-04-04 7 Freq: D, Name: Difference, dtype: int64 Entire rows from a DataFrame can be retrieved using the .loc and .iloc properties. .loc ensures that the lookup is by index label, where .iloc uses the 0-based position. - The following retrieves the second row of the DataFrame. In[24]:# get the row at array position 1 temps_df.iloc[1] Out [24]:Missoula 82 Philadelphia 75 Difference 7 Name: 2016-04-02 00:00:00, dtype: int64 Notice that this result has converted the row into a Series with the column names of the DataFrame pivoted into the index labels of the resulting Series.The following shows the resulting index of the result. In [25]:# the names of the columns have become the index # they have been 'pivoted' temps_df.iloc[1].index Out [25]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label: In [26]:# retrieve row by index label using .loc temps_df.loc['2016-04-05'] Out [26]:Missoula 83 Philadelphia 79 Difference 4 Name: 2016-04-05 00:00:00, dtype: int64 Specific rows in a DataFrame object can be selected using a list of integer positions. The following selects the values from the Differencecolumn in rows at integer-locations 1, 3, and 5: In [27]:# get the values in the Differences column in rows 1, 3 # and 5using 0-based location temps_df.iloc[[1, 3, 5]].Difference Out [27]:2016-04-02 7 2016-04-04 7 2016-04-06 10 Freq: 2D, Name: Difference, dtype: int64 Rows of a DataFrame can be selected based upon a logical expression that is applied to the data in each row. The following shows values in the Missoulacolumn that are greater than 82 degrees: In [28]:# which values in the Missoula column are > 82? temps_df.Missoula > 82 Out [28]:2016-04-01 False 2016-04-02 False 2016-04-03 True 2016-04-04 True 2016-04-05 True 2016-04-06 True Freq: D, Name: Missoula, dtype: bool The results from an expression can then be applied to the[] operator of a DataFrame (and a Series) which results in only the rows where the expression evaluated to Truebeing returned: In [29]:# return the rows where the temps for Missoula > 82 temps_df[temps_df.Missoula > 82] Out [29]: Missoula Philadelphia Difference 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 This technique is referred to as boolean election in pandas terminologyand will form the basis of selecting rows based upon values in specific columns (like a query in —SQL using a WHERE clause - but as we will see also being much more powerful). Visualization We will dive into visualization in quite some depth in Chapter 14, Visualization, but prior to then we will occasionally perform a quick visualization of data in pandas. Creating a visualization of data is quite simple with pandas. All that needs to be done is to call the .plot() method. The following demonstrates by plotting the Close value of the stock data. In [40]: df[['Close']].plot(); Summary In this article, took an introductory look at the pandas Series and DataFrame objects, demonstrating some of the fundamental capabilities. This exposition showed you how to perform a few basic operations that you can use to get up and running with pandas prior to diving in and learning all the details. Resources for Article: Further resources on this subject: Using indexes to manipulate pandas objects [article] Predicting Sports Winners with Decision Trees and pandas [article] The pandas Data Structures [article]
Read more
  • 0
  • 0
  • 3151

article-image-massive-graphs-big-data
Packt
11 Jul 2017
19 min read
Save for later

Massive Graphs on Big Data

Packt
11 Jul 2017
19 min read
In this article by Rajat Mehta, author of the book Big Data Analytics with Java, we will learn about graphs. Graphs theory is one of the most important and interesting concepts of computer science. Graphs have been implemented in real life in a lot of use cases. If you use a GPS on your phone or a GPS device and it shows you a driving direction to a place, behind the scene there is an efficient graph that is working for you to give you the best possible direction. In a social network you are connected to your friends and your friends are connected to other friends and so on. This is a massive graph running in production in all the social networks that you use. You can send messages to your friends or follow them or get followed all in this graph. Social networks or a database storing driving directions all involve massive amounts of data and this is not data that can be stored on a single machine, instead this is distributed across a cluster of thousands of nodes or machines. This massive data is nothing but big data and in this article we will learn how data can be represented in the form of a graph so that we make analysis or deductions on top of these massive graphs. In this article, we will cover: A small refresher on the graphs and its basic concepts A small introduction on graph analytics, its advantages and how Apache Spark fits in Introduction on the GraphFrames library that is used on top of Apache Spark Before we dive deeply into each individual section, let's look at the basic graph concepts in brief. (For more resources related to this topic, see here.) Refresher on graphs In this section, we will cover some of the basic concepts of graphs, and this is supposed to be a refresher section on graphs. This is a basic section, hence if some users already know this information they can skip this section. Graphs are used in many important concepts in our day-to-day lives. Before we dive into the ways of representing a graph, let's look at some of the popular use cases of graphs (though this is not a complete list) Graphs are used heavily in social networks In finding driving directions via GPS In many recommendation engines In fraud detection in many financial companies In search engines and in network traffic flows In biological analysis As you must have noted earlier, graphs are used in many applications that we might be using on a daily basis. Graphs is a form of a data structure in computer science that helps in depicting entities and the connection between them. So, if there are two entities such as Airport A and Airport B and they are connected by a flight that takes, for example, say a few hours then Airport A and Airport B are the two entities and the flight connecting between them that takes those specific hours depict the weightage between them or their connection. In formal terms, these entities are called as vertexes and the relationship between them are called as edges. So, in mathematical terms, graph G = {V, E},that is, graph is a function of vertexes and edges. Let's look at the following diagram for the simple example of a graph: As you can see, the preceding graph is a set of six vertexes and eight edges,as shown next: Vertexes = {A, B, C, D, E, F} Edges = { AB, AF, BC, CD, CF, DF, DE, EF} These vertexes can represent any entities, for example, they can be places with the edges being 'distances' between the places or they could be people in a social network with the edges being the type of relationship, for example, friends or followers. Thus, graphs can represent real-world entities like this. The preceding graph is also called a bidirected graph because in this graph the edges go in either direction that is the edge from A to B can be traversed both ways from A to B as well as from B to A. Thus, the edges in the preceding diagrams that is AB can be BA or AF can be FA too. There are other types of graphs called as directed graphs and in these graphs the direction of the edges go in one way only and does not retrace back. A simple example of a directed graph is shown as follows:. As seen in the preceding graph, the edge A to B goes only in one direction as well as the edge B to C. Hence, this is a directed graph. A simple linked list data structure or a tree datastructure are also forms of graph only. In a tree, nodes can have children only and there are no loops, while there is no such rule in a general graph. Representing graphs Visualizing a graph makes it easily comprehensible but depicting it using a program requires two different approaches Adjacency matrix: Representing a graph as a matrix is easy and it has its own advantages and disadvantages. Let's look at the bidirected graph that we showed in the preceding diagram. If you would represent this graph as a matrix, it would like this: The preceding diagram is a simple representation of our graph in matrix form. The concept of matrix representation of graph is simple—if there is an edge to a node we mark the value as 1, else, if the edge is not present, we mark it as 0. As this is a bi-directed graph, it has edges flowing in one direction only. Thus, from the matrix, the rows and columns depict the vertices. There if you look at the vertex A, it has an edge to vertex B and the corresponding matrix value is 1. As you can see, it takes just one step or O[1]to figure out an edge between two nodes. We just need the index (rows and columns) in the matrix and we can extract that value from it. Also, if you would have looked at the matrix closely, you would have seen that most of the entries are zero, hence this is a sparse matrix. Thus, this approach eats a lot of space in computer memory in marking even those elements that do not have an edge to each other, and this is its main disadvantage. Adjacency list: Adjacency list solves the problem of space wastage of adjacency matrix. To solve this problem, it stores the node and its neighbors in a list (linked list) as shown in the following diagram: For maintaining brevity, we have not shown all the vertices but you can make out from the diagram that each vertex is storing its neighbors in a linked list. So when you want to figure out the neighbors of a particular vertex, you can directly iterate over the list. Of course this has the disadvantage of iterating when you have to figure out whether an edge exists between two nodes or not. This approach is also widely used in many algorithms in computer science. We have briefly seen how graphs can be represented, let's now see some important terms that are used heavily on graphs. Common terminology on graphs We will now introduce you to some common terms and concepts in graphs that you can use in your analytics on top of graphs: Vertices:As we mentioned earlier, vertices are the mathematical terms for the nodes in a graph. For analytic purposes, thevertices count shows the number of nodes in the system, for example, the number of people involved in a social graph. Edges: As we mentioned earlier, edges are the connection between vertices and edges can carry weights. The number of edges represent the number of relations in a system of graph. The weight on a graph represents the intensity of the relationship between the nodes involved; for example, in a social network, the relationship of friends is a stronger relationship than followed between nodes. Degrees: Represent the total number of connections flowing into as well as out of a node. For example, in the previous diagram the degree of node F is four. The degree count is useful, for example, in a social network graph it can represent how well a person is connected if his degree count is very high. Indegrees: This represents the number of connections flowing into a node. For example, in the previous diagram, for node F the indegree value is three. In a social network graph, this might represent how many people can send messages to this person or node. Oudegrees: This represents the number of connections flowing out of a node. For example, in the previous diagram, for node F the outdegree value is one. In a social network graph, this might represent how many people can send messages to this person or node. Common algorithms on graphs Let's look at the three common algorithms that are run on graphs frequently and some of their uses: Breadth first search: Breadth first search is an algorithm for graph traversal or searching. As the name suggests, the traversal occurs across the breadth of the graphs that is to say the neighbors of the node form where traversal starts are searched first before exploring further in the same manner. We will refer to the same graph we used earlier: If we start at vertex A, then according to breadth first search next we search or go at the neighbors of A that's B and F. After that, we will go at the neighbors of B and that will be C. Next we will go to the neighbours of F and those will be E and D. We only go through each node once and this mimics real-life travel as well, as to reach from a point to another point we seldom cover the same road or path again. Thus, our breadth first traversal starting from A will be {A , B , F , C , D , E }. Breadth first search is very useful in graph analytics and can tell us things such as the friends that are not your immediate friends but just at the next level after your immediate friends in a social network or in the case of a graph of a flights network it can show flights with just a single stop or two stops to the destination. Depth first search: This is another way of searching where we start from the source vertice and keep on searching until we reach the end node or the leaf node and then we backtrack. This algorithm is not as performant as the bread first search as it requires lots of traversals. So if you want to know if a node A is connected to node B, you might end up searching along a lot of wasteful nodes that do not have anything to do with the original nodes A and B before coming at the appropriate solution. Dijkstra's shortest path: This is a greedy algorithm to find the shortest path in a graph network. So in a weighted graph, if you need to find the shortest path between two nodes, you can start from the starting node and keep on picking the next node in path greedily to be the one with the least weight (in the case of weights being distances between nodes like in city graphs depicting interconnecting cities and roads).So in a road network, you can find the shortest path between two cities using this algorithm. PageRank algorithm: This is a very popular algorithm that came out from Google and it essentially is used to find the importance of a web page by figuring out how connected it is to other important websites. It gives a page rank score to each of the websites based on this approach and finally the search results are built based on this score. The best part about this algorithm is it can be applied to other areas in life too, for example, in figuring out the important airports in a flight graph, or figuring out the most important people in a social network group. So much for the basics and refresher on graphs, in the next section, we will now see how graphs can be used in real world in massive datasets such as social network data or in data used in the field of biology. We will also study how graph analytics can be used on top of these graphs to derive exclusive deductions. Plotting graphs There is a handy open source Java library called GraphStream, which can be used to plot graphs and this is very useful specially if you want to view the structure of your graphs. While viewing, you can also figure out if some of the vertices are very close to each other (clustered) or in general how they are placed. Using the GraphStream library is easy. Just download the jar from http://graphstream-project.org and put it in the classpath of your project. Next, we will show a simple example demonstrating how easy it is to plot a graph using this library. Just create an instance of a graph. For our example, we will create a simple DefaultGraph and name it SimpleGraph. Next, we will add the nodes or vertices of the graph. We will also add the attribute of the label that is displayed on the vertice. Graph graph = newDefaultGraph("SimpleGraph"); graph.addNode("A" ).setAttribute("ui.label", "A"); graph.addNode("B" ).setAttribute("ui.label", "B"); graph.addNode("C" ).setAttribute("ui.label", "C"); After building the nodes, it's now time to connect these nodes using the edges. The API is simple to use and on the graph instance we can define the edges, provided an ID is given to them and the starting and ending nodes are also given.   graph.addEdge("AB", "A", "B"); graph.addEdge("BC", "B", "C"); graph.addEdge("CA", "C", "A"); All the information of nodes and edges is present on the graph instance. It's now time to plot this graph on the UI and we can just invoke the display method on the graph instance as shown next and display it on the UI. graph.display(); This would plot the graph on the UI as follows: This library is extensive and it will be good learning experience to explore this library further and we would urge the readers to further explore this library on their own. Massive graphs on big data Big data comprises huge amount of data distributed across a cluster of thousands (if not more) of machines. Building graphs based on this massive data has different challenges shown as follows: Due to the vast amount of data involved, the data for the graph is distributed across a cluster of machines. Hence, in actuality, it's not a single node graph and we have to build a graph that spans across a cluster of machines. A graph that spans across a cluster of machines would have vertices and edges spread across different machines and this data in a graph won't fit into the memory of one single machine. Consider your friend's list on Facebook; some of your friend's data in your Facebook friend list graph might lie on different machines and this data might be just tremendous in size. Look at an example diagram of a graph of 10 Facebook friends and their network shown as follows: As you can see in the preceding diagram, when for just 10 friends the data can be huge, and here since the graph is drawn by hand we have not even shown a lot of connections to make the image comprehensible, but in real life each person can have say more than thousands of connections. So imagine what will happen to a graph with say thousands if not more people on the list. As shown in the reasons we just saw, building massive graphs on big data is a different ball game altogether and there are few main approaches for building this massive graphs. From the perspective of big data building the massive graphs involve running and storing data parallely on many nodes. The two main approaches are bulk synchronous parallely and the pregel approach. Apache Spark follows the pregel approach. Covering these approaches in detail is out of scope of this book and if the users are interested more on these topics they should refer to other books and the Wikipedia for the same. Graph analytics The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can't do by relational databases. Let's try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relation database are still good on many other use cases). As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too) Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it's currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs.It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. Summary In this article, we learned about graph analytics. We saw how graphs can be built even on top of massive big datasets. We learned how Apache Spark can be used to build these massive graphs and in the process we learned about the new library GraphFrames that helps us in building these graphs. Resources for Article: Further resources on this subject: Saying Hello to Java EE [article] Object-Oriented JavaScript [article] Introduction to JavaScript [article]
Read more
  • 0
  • 0
  • 25378
Modal Close icon
Modal Close icon