Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-machine-learning-algorithms-naive-bayes-with-spark-mllib

07 Nov 2017

7 min read

Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

07 Nov 2017

[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook, we look at how to implement Naïve Bayes classification algorithm with Spark 2.0 MLlib. The associated code and exercise are available at the end of the article.[/box] How to implement Naive Bayes with Spark MLlib Naïve Bayes is one of the most widely used classification algorithms which can be trained and optimized quite efficiently. Spark’s machine learning library, MLlib, primarily focuses on simplifying machine learning and has great support for multinomial naïve Bayes and Bernoulli naïve Bayes. Here we use the famous Iris dataset and use Apache Spark API NaiveBayes() to classify/predict which of the three classes of flower a given set of observations belongs to. This is an example of a multi-class classifier and requires multi-class metrics for measurements of fit. Let’s have a look at the steps to achieve this: For the Naive Bayes exercise, we use a famous dataset called iris.data, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups. In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolour, Iris Virginica).We can download the data from here: https://archive.ics.uci.edu/ml/datasets/Iris/ The column definition is as follows: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm Class: -- Iris Setosa => Replace it with 0 -- Iris Versicolour => Replace it with 1 -- Iris Virginica => Replace it with 2 The steps/actions we need to perform on the data are as follows: Download and then replace column five (that is, the label or classification classes) with a numerical value, thus producing the iris.data.prepared data file. The Naïve Bayes call requires numerical labels and not text, which is very common with most tools. Remove the extra lines at the end of the file. Remove duplicates within the program by using the distinct() call. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter6 Import the necessary packages for SparkSession to gain access to the cluster and Log4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics, MultilabelMetrics, binary} import org.apache.spark.sql.{SQLContext, SparkSession} import org.apache.log4j.Logger import org.apache.log4j.Level Initialize a SparkSession specifying configurations with the builder pattern, thus making an entry point available for the Spark cluster: val spark = SparkSession .builder .master("local[4]") .appName("myNaiveBayes08") .config("spark.sql.warehouse.dir", ".") .getOrCreate() val data = sc.textFile("../data/sparkml2/chapter6/iris.data.prepared.txt") Parse the data using map() and then build a LabeledPoint data structure. In this case, the last column is the Label and the first four columns are the features. Again, we replace the text in the last column (that is, the class of Iris) with numeric values (that is, 0, 1, 2) accordingly: val NaiveBayesDataSet = data.map { line => val columns = line.split(',') LabeledPoint(columns(4).toDouble , Vectors.dense(columns(0).toDouble,columns(1).toDouble,columns(2).to Double,columns(3).toDouble )) } Then make sure that the file does not contain any redundant rows. In this case, it has three redundant rows. We will use the distinct dataset going forward: println(" Total number of data vectors =", NaiveBayesDataSet.count()) val distinctNaiveBayesData = NaiveBayesDataSet.distinct() println("Distinct number of data vectors = ", distinctNaiveBayesData.count()) Output: (Total number of data vectors =,150) (Distinct number of data vectors = ,147) We inspect the data by examining the output: distinctNaiveBayesData.collect().take(10).foreach(println(_)) Output: (2.0,[6.3,2.9,5.6,1.8]) (2.0,[7.6,3.0,6.6,2.1]) (1.0,[4.9,2.4,3.3,1.0]) (0.0,[5.1,3.7,1.5,0.4]) (0.0,[5.5,3.5,1.3,0.2]) (0.0,[4.8,3.1,1.6,0.2]) (0.0,[5.0,3.6,1.4,0.2]) (2.0,[7.2,3.6,6.1,2.5]) .............. ................ ............. Split the data into training and test sets using a 30% and 70% ratio. The 13L in this case is simply a seeding number (L stands for long data type) to make sure the result does not change from run to run when using a randomSplit() method: val allDistinctData = distinctNaiveBayesData.randomSplit(Array(.30,.70),13L) val trainingDataSet = allDistinctData(0) val testingDataSet = allDistinctData(1) Print the count for each set: println("number of training data =",trainingDataSet.count()) println("number of test data =",testingDataSet.count()) Output: (number of training data =,44) (number of test data =,103) Build the model using train() and the training dataset: val myNaiveBayesModel = NaiveBayes.train(trainingDataSet Use the training dataset plus the map() and predict() methods to classify the flowers based on their features: val predictedClassification = testingDataSet.map( x => (myNaiveBayesModel.predict(x.features), x.label)) Examine the predictions via the output: predictedClassification.collect().foreach(println(_)) (2.0,2.0) (1.0,1.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (2.0,2.0) ....... ....... ....... Use MulticlassMetrics() to create metrics for the multi-class classifier. As a reminder, this is different from the previous recipe, in which we used BinaryClassificationMetrics(): val metrics = new MulticlassMetrics(predictedClassification) Use the commonly used confusion matrix to evaluate the model: val confusionMatrix = metrics.confusionMatrix println("Confusion Matrix= n",confusionMatrix) Output: (Confusion Matrix= ,35.0 0.0 0.0 0.0 34.0 0.0 0.0 14.0 20.0 ) We examine other properties to evaluate the model: val myModelStat=Seq(metrics.precision,metrics.fMeasure,metrics.recall) myModelStat.foreach(println(_)) Output: 0.8640776699029126 0.8640776699029126 0.8640776699029126 How it works... We used the IRIS dataset for this recipe, but we prepared the data ahead of time and then selected the distinct number of rows by using the NaiveBayesDataSet.distinct() API. We then proceeded to train the model using the NaiveBayes.train() API. In the last step, we predicted using .predict() and then evaluated the model performance via MulticlassMetrics() by outputting the confusion matrix, precision, and F-Measure metrics. The idea here was to classify the observations based on a selected feature set (that is, feature engineering) into classes that correspond to the left-hand label. The difference here was that we are applying joint probability given conditional probability to the classification. This concept is known as Bayes' theorem, which was originally proposed by Thomas Bayes in the 18th century. There is a strong assumption of independence that must hold true for the underlying features to make Bayes' classifier work properly. At a high level, the way we achieved this method of classification was to simply apply Bayes' rule to our dataset. As a refresher from basic statistics, Bayes' rule can be written as follows: The formula states that the probability of A given B is true is equal to probability of B given A is true times probability of A being true divided by probability of B being true. It is a complicated sentence, but if we step back and think about it, it will make sense. The Bayes' classifier is a simple yet powerful one that allows the user to take the entire probability feature space into consideration. To appreciate its simplicity, one must remember that probability and frequency are two sides of the same coin. The Bayes' classifier belongs to the incremental learner class in which it updates itself upon encountering a new sample. This allows the model to update itself on-the-fly as the new observation arrives rather than only operating in batch mode. We evaluated a model with different metrics. Since this is a multi-class classifier, we have to use MulticlassMetrics() to examine model accuracy. [box type="download" align="" class="" width=""]Download exercise and code files here. Exercise Files_Implementing Naive Bayes algorithm with Spark MLlib[/box] For more information on Multiclass Metrics, please see the following link: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib .evaluation.MulticlassMetrics Documentation for constructor can be found here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes If you enjoyed this article, you should have a look at Apache Spark 2.0 Machine Learning Cookbook which contains this excerpt.

0
0
29257

article-image-pattern-mining-using-spark-mllib-part-2

Aarthi Kumaraswamy

06 Nov 2017

15 min read

Pattern mining using Spark MLlib - Part 2

Aarthi Kumaraswamy

06 Nov 2017

15 min read

0
0
5784

article-image-pattern-mining-using-spark-part-1

Aarthi Kumaraswamy

03 Nov 2017

15 min read

Pattern Mining using Spark MLlib - Part 1

Aarthi Kumaraswamy

03 Nov 2017

15 min read

[box type="note" align="" class="" width=""]The following two-part tutorial is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this tutorial is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides. In this tutorial, we will look into doing pattern mining in Spark. The tutorial is split up into two main sections. In the first, we will first introduce the three available pattern mining algorithms that Spark currently comes with and then apply them to an interesting dataset. In particular, you will learn the following from this two-part tutorial: The basic principles of frequent pattern mining. Useful and relevant data formats for applications. Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span. Frequent pattern mining When presented with a new data set, a natural sequence of questions is: What kind of data do we look at; that is, what structure does it have? Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data? How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it? On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data. To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item. Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three: Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often brought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system. Based on the previous question, are there any interesting implications or rules to observe in shopping behavior?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean. Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items. The reason we focus on these three questions, in particular, is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP- growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection. Pattern mining terminology We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions. To give a concrete example of this, consider the following situation. Assume that the full item set to shop from is given by I = {bread, cheese, ananas, eggs, donuts, fish, pork, milk, garlic, ice cream, lemon, oil, honey, jam, kale, salt}. Since we will look at a lot of item subsets, to make things more readable later on, we will simply abbreviate these items by their first letter, that is, we'll write I = {b, c, a, e, d, f, p, m, g, i, l, o, h, j, k, s}. Given these items, a small transaction database D could look as follows: Transaction ID Transaction 1 a, c, d, f, g, i, m, p 2 a, b, c, f, l, m, o 3 b, f, h, j, o 4 b, c, k, s, p 5 a, c, e, f, l, m, n, p Table 1: A small shopping cart database with five transactions Frequent pattern mining problem Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D: supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D| We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items. Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance. Given a minimum support threshold, the problem of finding all the frequent patterns is called the frequent pattern mining problem and it is, in fact, the formalized version of the aforementioned first question. Continuing with our example, we have found all frequent patterns of length 1 for t = 0.6 already. How do we find longer patterns? On a theoretical level, given unlimited resources, this is not much of a problem, since all we need to do is count the occurrences of items. On a practical level, however, we need to be smart about how we do so to keep the computation efficient. Especially for databases large enough for Spark to come in handy, it can be very computationally intense to address the frequent pattern mining problem. One intuitive way to go about this is as follows: Find all the frequent patterns of length 1, which requires one full database scan. This is how we started with in our preceding example. For patterns of length 2, generate all the combinations of frequent 1-patterns, the so-called candidates, and test if they exceed the minimum support by doing another scan of D. Importantly, we do not have to consider the combinations of infrequent patterns, since patterns containing infrequent patterns can not become frequent. This rationale is called the apriori principle. For longer patterns, continue this procedure iteratively until there are no more patterns left to combine. This algorithm, using a generate-and-test approach to pattern mining and utilizing the apriori principle to bound combinations, is called the apriori algorithm. There are many variations of this baseline algorithm, all of which share similar drawbacks in terms of scalability. For instance, multiple full database scans are necessary to carry out the iterations, which might already be prohibitively expensive for huge datasets. On top of that, generating candidates themselves is already expensive, but computing their combinations might simply be infeasible. In the next section, we will see how a parallel version of an algorithm called FP-growth, available in Spark, can overcome most of the problems just discussed. The association rule mining problem To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows: conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1) This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity. In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case: {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p} {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p} {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m} {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p} {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m} From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront. [box type="info" align="" class="" width=""]At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http:/ / www. cse. msu. edu/ ~ptan/ papers/ IS. pdf.[/box] The sequential pattern mining problem Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows: s = <s1, s2, ..., sl> Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example: Sequence ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Table 2: A small sequence database with four short sequences. In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u1, ..., un) a subsequence of s = (s1, ..., sl) and write u < s if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following: u1 < si1, ..., un < sin Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>. With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows: suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S| Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t. Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section. Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced. If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.

0
0
5041

article-image-classification-decision-trees-apache-spark-mllib

Wilson D'souza

02 Nov 2017

9 min read

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza

02 Nov 2017

9 min read

[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster: Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file: val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset: val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count: println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset: val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset: def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters: val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first: evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity: evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session: spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.

0
0
26234

article-image-building-motion-charts-tableau

Ashwin Nair

31 Oct 2017

4 min read

Building Motion Charts with Tableau

Ashwin Nair

31 Oct 2017

4 min read

[box type="info" align="" class="" width=""]The following is an excerpt from the book Tableau 10 Bootcamp, Chapter 2, Interactivity – written by Joshua N. Milligan and Donabel Santos. It offers intensive training on Data Visualization and Dashboarding with Tableau 10. In this article, we will learn how to build motion charts with Tableau.[/box] Tableau is an amazing platform for achieving incredible data discovery, analysis, and Storytelling. It allows you to build fully interactive dashboards and stories with your visualizations and insights so that you can share the data story with others. Creating Motion Charts with Tableau Let`s learn how to build motion charts with Tableau. A motion chart, as its name suggests, is a chart that displays the entire trail of changes in data over time by showing movement using the X and Y-axes. It is very much similar to the doodles in our notebooks which seem to come to life after flipping through the pages. It is amazing to see the same kind of movement in action in Tableau using the Pagesshelf. It is work that feels like play. On the Pages shelf, when you drop a field, Tableau creates a sequence of pages that filters the view for each value in that field. Tableau's page control allows us to flip pages, enabling us to see our view come to life. With three predefined speed settings, we can control the speed of the flip. The three settings include one that relates to the slowest speed, the others to the fastest speed. We can also format the marks and show the marks or trails, or both, using page control. In our viz, we have used a circle for marking each year. The circle that moves to a new position each year represents the specific country's new population value. These circles are all connected by trail lines that enable us to simulate a moving time series graph by setting the mark and trail histories both to show in page control: Let's create an animated motion chart showing the population change over the years for a selected few countries: Open the Motion Chart worksheet and connect to the CO2 (Worldbank) data Source: Open Dimensions and drag Year to the Columns shelf. Open Measures and drag CO2 Emission to the Rows shelf. Right-click on the CO2 Emission axis, and change the title to CO2 Emission (metric tons per capita): In the Marks card, click on the dropdown to change the mark from Automatic to Circle. Open Dimensions and drag Country Name to Color in the Marks card. Also, drag Country Name to the Filter shelf from Dimensions Under the General tab of the Filter window, while the Select from list radio button is selected, select None. Select the Custom value list radio button, still under the General tab, and add China, Trinidad and Tobago, and United States: Click OK when done. This should close the Filter window. Open Dimensions and drag Year to Pages for adding a page control to the view. Click on the Show history checkbox to select it. Click on the drop-down beside Show history and perform the following steps: Select All for Marks to show history for Select Both for Show Using the Year page control, click on the forward arrow to play. This shows the change in the population of the three selected countries over the years. [box type="info" align="" class="" width=""]Tip - In case you ever want to loopback the animation, you can click on the dropdown on the top-right of your page control card, and select Loop Playback:[/box] Note that Tableau Server does not support the animation effect that you see when working on motion charts with Tableau Desktop. Tableau strives for zero footprints when serving the charts and dashboards on the server so that there is no additional download to enable the functionalities. So, the play control does not work the same. No need to fret though. You can click manually on the slider and have a similar effect. If you liked the above excerpt from the book Tableau 10 Bootcamp, check out the book to learn more data visualization techniques.

0
0
10981

article-image-halloween-costume-data-science-nerds

Packt Editorial Staff

31 Oct 2017

14 min read

(13*3)+ Halloween costume ideas for Data science nerds

Packt Editorial Staff

31 Oct 2017

14 min read

Are you a data scientist, a machine learning engineer, an AI researcher or simply a data enthusiast? Channel the inner data science nerd within you with these geeky ideas for your Halloween costumes! The Data Science Spectrum Don't know what to go as to this evening's party because you've been busy cleaning that terrifying data? Don’t worry, here are some easy-to-put-together Halloween costume ideas just for you. [dropcap]1[/dropcap] Big Data Go as Baymax, the healthcare robot, (who can also turn into battle mode when required). Grab all white clothes that you have. Stuff your tummy with some pillows and wear a white mask with cutouts for eyes. You are all ready to save the world. In fact, convince a friend or your brother to go as Hiro! [dropcap]2[/dropcap] A.I. agent Enter as Agent Smith, the AI antagonist, this Halloween. Lure everyone with your bold black suit paired with a white shirt and a black tie. A pair of polarized sunglasses would replicate you as the AI agent. Capture the crowd by being the most intelligent and cold-hearted personality of all. [dropcap]3[/dropcap] Data Miner Put on your dungaree with a tee. Fix a flashlight atop your cap. Grab a pickaxe from the gardening toolkit, if you have one. Stripe some mud onto your face. Enter the party wheeling with loads of data boxes that you have freshly mined. You’ll definitely grab some traffic for data. Unstructured data anyone? [dropcap]4[/dropcap] Data Lake Go as a Data lake this Halloween. Simply grab any blue item from your closet. Draw some fishes, crabs, and weeds. (Use a child’s marker for that). After all, it represents the data you have. And you’re all set. [dropcap]5[/dropcap] Dark Data Unleash the darkness within your soul! Just kidding. You don’t actually have to turn to the evil side. Just coming up with your favorite black-costume character would do. Looking for inspiration? Maybe, a witch, The dark knight, or The Darth Vader. [dropcap]6[/dropcap] Cloud A fluffy, white cloud is what you need to be this Halloween. Raid your nearby drug store for loads of cotton balls. Better still, tear up that old pillow you have been meaning to throw away for a while. Use the fiber inside to glue onto an unused tee. You will be the cutest cloud ever seen. Don’t forget to carry an umbrella in case you turn grey! [dropcap]7[/dropcap] Predictive Analytics Make your own paper wizard hat with silver stars and moons pasted on it. If you can arrange for an advocate gown, it would be great. Else you could use a long black bed sheet as a cape. And most importantly, a crystal ball to show off some prediction stunts at the Halloween. [dropcap]8[/dropcap] Gradient boosting Enter Halloween as the energy booster. Wear what you want. Grab loads of empty energy drink tetra packs and stick it all over you. Place one on your head too. Wear a nameplate that says “ G-booster Energy drink”. Fuel up some weak models this Halloween. [dropcap]9[/dropcap] Cryptocurrency Wear head to toe black. In fact, paint your face black as well, like the Grim reaper. Then grab a cardboard piece. Cut out a circle, paint it orange, and then draw a gold B symbol, just like you see in a bitcoin. This Halloween costume will definitely grab you the much-needed attention just as this popular cryptocurrency. [dropcap]10[/dropcap] IoT Are you a fan of IoT and the massive popularity it has gained? Then you should definitely dress up as your web-slinging, friendly neighborhood Spiderman. Just grab a spiderman costume from any costume store and attach some handmade web slings. Remember to connect with people by displaying your IoT knowledge. [dropcap]11[/dropcap] Self-driving car Choose a mono-color outfit of your choice (P.S. The color you would choose for your car). Cut out four wheels and paste two on your lower calves and two on your arms. Cut out headlights too. Put on a wiper goggle. And yes you do not need a steering wheel or the brakes, clutch and the accelerator. Enter the Halloween at your own pace, go self-driving this Halloween. Bonus point: You can call yourself Bumblebee or Optimus Prime. Machine Learning and Deep learning Frameworks If machine learning or deep learning is your forte, here are some fresh Halloween costume ideas based on some of the popular frameworks in that space. [dropcap]12[/dropcap] Torch Flame up the party with a costume inspired by the fantastic four superhero, Johnny Storm a.k.a The Human Torch. Wear a yellow tee and orange slacks. Draw some orange flames on your tee. And finally, wear a flame-inspired headband. Someone is a hot machine learning library! [dropcap]13[/dropcap] TensorFlow No efforts for this one. Just arrange for a pumpkin costume, paste a paper cut-out of the TensorFlow logo and wear it as a crown. Go as the most powerful and widely popular deep learning library. You will be the star of the Halloween as you are a Google Kid. [dropcap]14[/dropcap] Caffe Go as your favorite Starbucks coffee this Halloween. Wear any of your brown dress/ tee. Draw or stick a Starbucks logo. And then add frothing to the top by bunching up a cream-colored sheet. Mamma Mia! [dropcap]15[/dropcap] Pandas Go as a Panda this Halloween! Better still go as a group of Pandas. The best option is to buy a panda costume. But if you don’t want that, wear a white tee, black slacks, black goggles and some cardboard cutouts for ears. This will make you not only the cutest animal in the party but also a top data manipulation library. Good luck finding your python in the party by the way. [dropcap]16[/dropcap] Jupyter Notebook Go as a top trending open-source web application by dressing up as the largest planet in our solar system. People would surely be intimidated by your mass and also by your computing power. [dropcap]17[/dropcap] H2O Go to Halloween as a world famous open source deep learning platform. No, no, you don’t have to go as the platform itself. Instead go as the chemical alter-ego, water. Wear all blue and then grab some leftover asymmetric, blue cloth pieces to stick at your sides. Thirsty anyone? Data Viz & Analytics Tools If you’re all about analytics and visualization, grab the attention of every data geek in your party by dressing up as your favorite data insight tools. [dropcap]18[/dropcap] Excel Grab an old white tee and paint some green horizontal stripes. You’re all ready to go as the most widely used spreadsheet. The simplest of costumes, yet the most useful - a timeless classic that never goes out of fashion. [dropcap]19[/dropcap] MatLab If you have seriously run out of all costume ideas, going out as MatLab is your only solution. Just grab a blue tablecloth. Stick or sew it with some orange curtain and throw it over your head. You’re all ready to go as the multi-paradigm numerical computing environment. [dropcap]20[/dropcap] Weka Wear a brown overall, a brown wig, and paint your face brown. Make an orange beak out of a chart paper, and wear a pair orange stockings/ socks with your trousers tucked in. You are all set to enter as a data mining bird with ML algorithms and Java under your wings. [dropcap]21[/dropcap] Shiny Go all Shimmery!! Get some glitter powder and put it all over you. (You’ll have a tough time removing it though). Else choose a glittery outfit, with glittery shoes, and touch-up with some glitter on your face. Let the party see the bling of R that you bring. You will be the attractive storyteller out there. [dropcap]22[/dropcap] Bokeh A colorful polka-dotted outfit and some dim lights to do the magic. You are all ready to grab the show with such a dazzle. Make sure you enter the party gates with Python. An eye-catching beauty with the beast pair. [dropcap]23[/dropcap] Tableau Enter the Halloween as one of your favorite characters from history. But there is a term and condition for this: You cannot talk or move. Enjoy your Halloween by being still. Weird, but you’ll definitely grab everyone’s eye. [dropcap]24[/dropcap] Microsoft Power BI Power up your Halloween party by entering as a data insights superhero. Wear a yellow turtleneck, a stylish black leather jacket, black pants, some mid-thigh high boots and a slick attitude. You’re ready to save your party! Data Science oriented Programming languages These hand-picked Halloween costume ideas are for you if you consider yourself a top coder. By a top coder we mean you’re all about learning new programming languages in your spare and, well, your not so spare time. [dropcap]25[/dropcap] Python Easy peasy as the language looks, the reptile is not that easy to handle. A pair of python-printed shirt and trousers would do the job. You could be getting more people giving you candies some out of fear, other out of the ease. Definitely, go as a top trending and a go-to language which everyone loves! And yes, don’t forget the fangs. [dropcap]26[/dropcap] R Grab an eye patch and your favorite leather pants. Wear a loose white shirt with some rugged waistcoat and a sword. Here you are all decked up as a pirate for your next loot. You’ll surely thank me for giving you a brilliant Halloween idea. But yes! Don’t forget to make that Arrrr (R) noise! [dropcap]27[/dropcap] Java Go as a freshly roasted coffee bean! People in your Halloween party would be allured by your aroma. They would definitely compliment your unique idea and also the fact that you’re the most popular programming language. [dropcap]28[/dropcap] SAS March in your Halloween party up as a Special Airforce Service (SAS) agent. You would be disciplined, accurate, precise and smart. Just like the advanced software suite that goes by the same name. You would need a full black military costume, with a gas mask, some fake ammunition from a nearby toy store, and some attitude of course! [dropcap]29[/dropcap] SQL If you pride yourself on being very organized or are a stickler for the rules, you should go as SQL this Halloween. Prep-up yourself with an overall blue outfit. Spike up your hair and spray some temporary green hair color. Cut out bold letters S, Q, and L from a plain white paper and stick them on your chest. You are now ready to enter the Halloween party as the most popular database of all times. Sink in all the data that you collect this Halloween. [dropcap]30[/dropcap] Scala If Scala is your favorite programming language, add a spring to your Halloween by going as, well, a spring! Wear the brightest red that you have. Using a marker, draw some swirls around your body (You can ask your mom to help). Just remember to elucidate a 3D picture. And you’re all set. [dropcap]31[/dropcap] Julia If you want to make a red carpet entrance to your Halloween party, go as the Academy award-winning actress, Julia Roberts. You can even take up inspiration from her character in the 90s hit film Pretty Woman. For extra oomph, wear a pink, red, and purple necklace to highlight the Julia programming language [dropcap]32[/dropcap] Ruby Act pricey this Halloween. Be the elegant, dynamic yet simple programming language. Go blood red, wear on your brightest red lipstick, red pumps, dazzle up with all the red accessories that you have. You’ll definitely gather some secret admirers around the hall. [dropcap]33[/dropcap] Go Go as the mascot of Go, the top trending programming language. All you need is a blue mouse costume. Fear not if you don’t have one. Just wear a powder blue jumpsuit, grab a baby pink nose, and clip on a fake single, large front tooth. Ready for the party! [dropcap]34[/dropcap] Octave Go as a numerically competent programming language. And if that doesn’t sound very trendy, go as piano keys depicting an octave. You simply need to wear all white and divide your space into 8 sections. Then draw 5 horizontal black stripes. You won’t be able to do that vertically, well, because they are a big number. Here you go, you’re all set to fill the party with your melody. Fancy an AI system inspired Halloween costume? This is for you if you love the way AI works and the enigma that it has thrown around the world. This is for you if you are spellbound with AI magic. You should go dressed as one of these at your Halloween party this season. Just pick up the AI you want to look like and follow as advised. [dropcap]35[/dropcap] IBM Watson Wear a dark blue hat, a matching long overcoat, a vest and a pale blue shirt with a dark tie tucked into the vest. Complement it with a mustache and a brooding look. You are now ready to be IBM Watson at your Halloween party. [dropcap]36[/dropcap] Apple Siri If you want to be all cool and sophisticated like the Apple’s Siri, wear an alluring black turtleneck dress. Don’t forget to carry your latest iPhone and air pods. Be sure you don’t have a sore throat, in case someone needs your assistance. [dropcap]37[/dropcap] Microsoft Cortana If Microsoft Cortana is your choice of voice assistant, dress up as Cortana, the fictional synthetic intelligence character in the Halo video game series. Wear a blue bodysuit. Get a bob if you’re daring. (A wig would also do). Paint some dark blue robot like designs over your body and well, your face. And you’re all set. [dropcap]38[/dropcap] Salesforce Einstein Dress up as the world’s most famous physicist and also an AI-powered CRM. How? Just grab a white shirt, a blue pullover and a blue tie (Salesforce colors). Finish your look with a brown tweed coat, brown pants and shoes, a rugged white wig and mustache, and a deep thought on your face. [dropcap]39[/dropcap] Facebook Jarvis Get inspired by the Iron man’s Jarvis, the coolest A.I. in the Marvel universe. Just grab a plexiglass, draw some holograms and technological symbols over it with a neon marker. (Try to keep the color palette in shades of blues and reds). And fix this plexiglass in a curved fashion in front of your face by a headband. Do practice saying “Hello Mr. Stark.” [dropcap]40[/dropcap] Amazon Echo This is also an easy one. Grab a long, black chart paper. Roll it around in a tube form around your body. Draw the Amazon symbol at the bottom with some glittery, silver sketch pen, color your hair blue, and there you go. If you have a girlfriend, convince her to go as Amazon Alexa. [dropcap]41[/dropcap] SAP Leonardo Put on a hat, wear a long cloak, some fake overgrown mustache, and beard. Accessorize with a color palette and a paintbrush. You will be the Leonardo da Vinci of the Halloween party. Wait a minute, don’t forget to cut out SAP initials and stick them on your cap. After all, you are entering as SAP’s very own digital revolution system. [dropcap]42[/dropcap] Intel Neon Deck the Halloween hall with a Harley Quinn costume. For some extra dramatization, roll up some neon blue lights around your head. Create an Intel logo out of some blue neon lights and wear it as your neckpiece. [dropcap]43[/dropcap] Microsoft Brainwave This one will require a DIY task. Arrange for a red and green t-shirt, cut them into a vertical half. Stitch it in such a way that the green is on the left and the red on the right. Similarly, do that with your blue and yellow pants; with yellow on the left and blue on the right. You will look like the most powerful Microsoft’s logo. Wear a skullcap with wires protruding out and a Hololens like eyewear to go with. And so, you are all ready to enter the Halloween party as Microsoft’s deep learning acceleration platform for real-time AI. [dropcap]44[/dropcap] Sophia, the humanoid Enter with all the confidence and a top-to-toe professional attire. Be ready to answer any question thrown at you with grace and without a stroke of skepticism. And to top it off, sport a clean shaved head. And there, you are all ready to blow off everyone’s mind with a mix of beauty with super intelligent brains. Happy Halloween folks!

0
0
29251

article-image-halloween-costume-ideas-inspired-apache-big-data-projects

Packt Editorial Staff

30 Oct 2017

3 min read

Halloween costume ideas inspired from Apache Big Data Projects

Packt Editorial Staff

30 Oct 2017

3 min read

If you are a busy person who is finding it difficult to decide a Halloween costume for your office party tomorrow or for your kid's trick-or-treating madness, here are some geeky Halloween costume ideas that will make the inner data nerd in you proud! Apache Hadoop Be the cute little yellow baby elephant everyone wants to cuddle. Just grab all the yellow clothes you have. If you don’t, borrow them. Don’t forget to stuff in mini cushions in you. Pop in loads of candy in your mouth. And there, you’re all set to be as the dominant but the cutest framework! Cuteness overloaded. Apache Hive Be the buzz of your Halloween party by going as a top Apache data warehouse. What to wear you ask? Hum around wearing a yellow and white striped dress or a shirt. Compliment your outfit with a pair of black wings, headband with antennae and a small pot of honey. Apache Storm An X-Men fan are you? Go as Storm, the popular fictional superhero. Wear a black bodysuit (leather if possible). Drape a long cape. Put on a grey wig. And channel your inner power. Perhaps people would be able to see the powerful weather-controlling mutant in you and also recognize your ability to process streaming data in real time. Apache Kafka Go all out gothic with an Apache Kafka costume. Dress in a serious black dress and gothic makeup. Don’t forget your black butterfly wings and a choker necklace with linked circles. Keep asking existential questions to random people at the party to throw them off balance. Apache Giraph Put on a yellow tee and brown trousers, cut out some brown imperfect circles and paste them on your tee. Put on a brown cap, and paint your ears brown. Draw some graph representations using a marker all over your hands and palms. You are now Apache Giraph. Apache Singa Be the blend of a flexible Apache Singa with the ferocity of a lion this Halloween! All you need is a yellow tee paired with light brown trousers. Wear a lion’s wig. Grab a mascara and draw some strokes on your cheeks. Paint the tip of your nose using a brown watercolour or some melted chocolate. Apache Spark If you have obsessed over Pokémon Go and equally love the lightning blaze data processing speed of Apache Spark, you should definitely go as the leader of Pokémon Go's Team Instinct. Spark wears an orange hoodie, a black and yellow leather jacket, black jeans and orange gloves. Do remember to carry your Pokemon balls in case you are challenged for a battle. Apache Pig A dark blue dungaree paired with a baby pink tee, a pair of white gloves, purple shoes and yes, a baby pink chart paper cut out of the pig’s face. Wear all of this on and you will look like an Apache Pig. Complement the look with a wide grin when you make an entrance. [caption id="attachment_1414" align="aligncenter" width="708"] Two baby boys dressed in animal costumes in autumn park, focus on baby in elephant costume[/caption] Happy Haloween folks! Watch this space for more data science themed Haloween costume ideas tomorrow.

0
0
15212

article-image-implementing-autoencoders-using-h2o

Amey Varangaonkar

27 Oct 2017

4 min read

Implementing Autoencoders using H2O

Amey Varangaonkar

27 Oct 2017

4 min read

[box type="note" align="" class="" width=""]This excerpt is taken from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this article, we see how R is an effective tool for neural network modelling, by implementing autoencoders using the popular H2O library.[/box] An autoencoder is an ANN used for learning without efficient coding control. The purpose of an autoencoder is to learn coding for a set of data, typically to reduce dimensionality. Architecturally, the simplest form of autoencoder is an advanced and non-recurring neural network very similar to the MLP, with an input level, an output layer, and one or more hidden layers that connect them, but with the layer outputs having the same number of input level nodes for rebuilding their inputs. In this section, we present an example of implementing Autoencoders using H2O on a movie dataset. The dataset used in this example is a set of movies and genre taken from https://grouplens.org/datasets/movielens We use the movies.csv file, which has three columns: movieId title genres There are 164,979 rows of data for clustering. We will use h2o.deeplearning to have the autoencoder parameter fix the clusters. The objective of the exercise is to cluster the movies based on genre, which can then be used to recommend similar movies or same genre movies to the users. The program uses h20.deeplearning, with the autoencoder parameter set to T: library("h2o") setwd ("c://R") #Load the training dataset of movies movies=read.csv ( "movies.csv", header=TRUE) head(movies) model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") summary(model) features=h2o.deepfeatures(model, as.h2o(movies), layer=1) d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) Now, let's go through the code: library("h2o") setwd ("c://R") These commands load the library in the R environment and set the working directory where we will have inserted the dataset for the next reading. Then we load the data: movies=read.csv( "movies.csv", header=TRUE) To visualize the type of data contained in the dataset, we analyze a preview of one of these variables: head(movies) The following figure shows the first 20 rows of the movie dataset: Now we build and train model: model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") Let's analyze some of the information contained in model: summary(model) This is an extract from the results of the summary() function: In the next command, we use the h2o.deepfeatures() function to extract the nonlinear feature from an h2o dataset using an H2O deep learning model: features=h2o.deepfeatures(model, as.h2o(movies), layer=1) In the following code, the first six rows of the features extracted from the model are shown: > features DF.L1.C1 DF.L1.C2 1 0.2569208 -0.2837829 2 0.3437048 -0.2670669 3 0.2969089 -0.4235294 4 0.3214868 -0.3093819 5 0.5586608 0.5829145 6 0.2479671 -0.2757966 [9125 rows x 2 columns] Finally, we plot a diagram where we want to see how the model grouped the movies through the results obtained from the analysis: d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) The plot of the movies, once clustering is done, is shown next. We have plotted only 100 movie titles due to space issues. We can see some movies being closely placed, meaning they are of the same genre. The titles are clustered based on distances between them, based on genre. Given a large number of titles, the movie names cannot be distinguished, but what appears to be clear is that the model has grouped the movies into three distinct groups. If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.

0
1
38655

article-image-top-5-machine-learning-movies

Chris Key

17 Oct 2017

3 min read

Top 5 Machine Learning Movies

Chris Key

17 Oct 2017

3 min read

Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.

0
0
4394

Packt

16 Aug 2017

8 min read

Machine Learning Models

Packt

16 Aug 2017

8 min read

In this article by Pratap Dangeti, the author of the book Statistics for Machine Learning, we will take a look at ridge regression and lasso regression in machine learning. (For more resources related to this topic, see here.) Ridge regression and lasso regression In linear regression only residual sum of squares (RSS) are minimized, whereas in ridge and lasso regression, penalty applied (also known as shrinkage penalty) on coefficient values to regularize the coefficients with the tuning parameter λ. When λ=0 penalty has no impact, ridge/lasso produces the same result as linear regression, whereas λ => ∞ will bring coefficients to zero. Before we go in deeper on ridge and lasso, it is worth to understand some concepts on Lagrangian multipliers. One can show the preceding objective function into the following format, where objective is just RSS subjected to cost constraint (s) of budget. For every value of λ, there is some s such that will provide the equivalent equations as shown as follows for overall objective function with penalty factor: The following graph shows the two different Lagrangian format: Ridge regression works well in situations where the least squares estimates have high variance. Ridge regression has computational advantages over best subset selection which required 2P models. In contrast for any fixed value of λ, ridge regression only fits a single model and model-fitting procedure can be performed very quickly. One disadvantage of ridge regression is, it will include all the predictors and shrinks the weights according with its importance but it does not set the values exactly to zero in order to eliminate unnecessary predictors from models, this issue will be overcome in lasso regression. During the situation of number of predictors are significantly large, using ridge may provide good accuracy but it includes all the variables, which is not desired in compact representation of the model, this issue do not present in lasso as it will set the weights of unnecessary variables to zero. Model generated from lasso are very much like subset selection, hence it is much easier to interpret than those produced by ridge regression. Example of ridge regression machine learning model Ridge regression is machine learning model, in which we do not perform any statistical diagnostics on the independent variables and just utilize the model to fit on test data and check the accuracy of fit. Here we have used scikit-learn package: >>> from sklearn.linear_model import Ridge >>> wine_quality = pd.read_csv( >>> wine_quality.rename(columns=lambda x: x.replace(" ", inplace=True) >>> all_colnms = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', Article_01.png λ, ridge regression only fits a idge exactly to zero in egression. are significantly large, idge variables, which is not model, this issue do not present in lasso as it will e regression machine learning model learning model, in which we do not perform any statistical the model to fit on test data and have used scikit-learn package: csv("winequality-red.csv",sep=';') "_"), in ariables, asso odel earning el 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] >>> pdx = wine_quality[all_colnms] >>> pdy = wine_quality["quality"] >>> x_train,x_test,y_train,y_test = train_test_split(pdx,pdy,train_size = 0.7,random_state=42) Simple version of grid search from scratch has been described as follows, in which various values of alphas are tried to be tested in grid search to test the model fitness: >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] Initial values of R-squared are set to zero in order to keep track on its updated value and to print whenever new value is greater than exiting value: >>> initrsq = 0 >>> print ("nRidge Regression: Best Parametersn") >>> for alph in alphas: ... ridge_reg = Ridge(alpha=alph) ... ridge_reg.fit(x_train,y_train) 0 ... tr_rsqrd = ridge_reg.score(x_train,y_train) ... ts_rsqrd = ridge_reg.score(x_test,y_test) The following code always keep track on test R-squared value and prints if new value is greater than existing best value: >>> if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is is shown in the following screenshot: By looking into test R-squared (0.3513) value we can conclude that there is no significant relationship between independent and dependent variables. Also, please note that, the test R-squared value generated from ridge regression is similar to value obtained from multiple linear regression (0.3519), but with the no stress on diagnostics of variables, and so on. Hence machine learning models are relatively compact and can be utilized for learning automatically without manual intervention to retrain the model, this is one of the biggest advantages of using ML models for deployment purposes. The R code for ridge regression on wine quality data is shown as follows: # Ridge regression library(glmnet) wine_quality = read.csv("winequality-red.csv",header=TRUE,sep = ";",check.names = FALSE) names(wine_quality) <- gsub(" ", "_", names(wine_quality)) set.seed(123) numrow = nrow(wine_quality) trnind = sample(1:numrow,size = as.integer(0.7*numrow)) train_data = wine_quality[trnind,]; test_data = wine_quality[- trnind,] xvars = c("fixed_acidity","volatile_acidity","citric_acid","residual_sugar ","chlorides","free_sulfur_dioxide", "total_sulfur_dioxide","density","pH","sulphates","alcohol") yvar = "quality" x_train = as.matrix(train_data[,xvars]);y_train = as.double (as.matrix (train_data[,yvar])) x_test = as.matrix(test_data[,xvars]) print(paste("Ridge Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ ridge_fit = glmnet(x_train,y_train,alpha = 0,lambda = lmbd) pred_y = predict(ridge_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Example of lasso regression model Lasso regression is close cousin of ridge regression, in which absolute values of coefficients are minimized rather than square of values. By doing so, we eliminate some insignificant variables, which are very much compacted representation similar to OLS methods. Following implementation is almost similar to ridge regression apart from penalty application on mod/absolute value of coefficients: >>> from sklearn.linear_model import Lasso >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] >>> initrsq = 0 >>> print ("nLasso Regression: Best Parametersn") >>> for alph in alphas: ... lasso_reg = Lasso(alpha=alph) ... lasso_reg.fit(x_train,y_train) ... tr_rsqrd = lasso_reg.score(x_train,y_train) ... ts_rsqrd = lasso_reg.score(x_test,y_test) ... if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is shown in the following screenshot: Lasso regression produces almost similar results as ridge, but if we check the test R-squared values bit carefully, lasso produces little less values. Reason behind the same could be due to its robustness of reducing coefficients to zero and eliminate them from analysis: >>> ridge_reg = Ridge(alpha=0.001) >>> ridge_reg.fit(x_train,y_train) >>> print ("nRidge Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",ridge_reg.coef_[i]) >>> lasso_reg = Lasso(alpha=0.001) >>> lasso_reg.fit(x_train,y_train) >>> print ("nLasso Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",lasso_reg.coef_[i]) Following results shows the coefficient values of both the methods, coefficient of density has been set to o in lasso regression whereas density value is -5.5672 in ridge regression; also none of the coefficients in ridge regression are zero values: R Code – Lasso Regression on Wine Quality Data # Above Data processing steps are same as Ridge Regression, only below section of the code do change # Lasso Regression print(paste("Lasso Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ lasso_fit = glmnet(x_train,y_train,alpha = 1,lambda = lmbd) pred_y = predict(lasso_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Regularization parameters in linear regression and ridge/lasso regression Adjusted R-squared in linear regression always penalizes adding extra variables with less significance is one type of regularizing the data in linear regression, but it will adjust to unique fit of the model. Whereas in machine learning many parameters are adjusted to regularizing the overfitting problem, in the example of lasso/ridge regression penalty parameter (λ) to regularization, there are infinite values can be applied to regularize the model infinite ways: In overall there are many similarities between statistical way and machine learning ways of predicting the pattern. Summary We have seen ridge regression and lasso regression with their examples and we have also seen its regularization parameters. Resources for Article: Further resources on this subject: Machine Learning Review [article] Getting Started with Python and Machine Learning [article] Machine learning in practice [article]

0
0
3492

article-image-introduction-latest-social-media-landscape-and-importance

Packt

14 Aug 2017

10 min read

Introduction to the Latest Social Media Landscape and Importance

Packt

14 Aug 2017

10 min read

In this article by Siddhartha Chatterjee and Michal Krystyanczuk, author of the book, Python Social Media Analytics, starts with a question to you: Have you seen the movie Social Network? If you have not, it could be a good idea to see it before you read this. If you have, you may have seen the success story around Mark Zuckerberg and his company Facebook. This was possible due to power of the platform in connecting, enabling, sharing, and impacting the lives of almost two billion people on this planet. The earliest Social Networks existed as far back as 1995; such as Yahoo (Geocities), theglobe.com, and tripod.com. These platforms were mainly to facilitate interaction among people through chat rooms. It was only at the end of the 90s that user profiles became the in thing in social networking platforms, allowing information about people to be discoverable, and therefore, providing a choice to make friends or not. Those embracing this new methodology were Makeoutclub, Friendster, SixDegrees.com, and so on. MySpace, LinkedIn, and Orkut were thereafter created, and the social networks were on the verge of becoming mainstream. However, the biggest impact happened with the creation of Facebook in 2004; a total game changer for people's lives, business, and the world. The sophistication and the ease of using the platform made it into mainstream media for individuals and companies to advertise and sell their ideas and products. Hence, we are in the age of social media that has changed the way the world functions. Since the last few years, there have been new entrants in the social media, which are essentially of different interaction models as compared to Facebook, LinkedIn, or Twitter. These are Pinterest, Instagram, Tinder, and others. Interesting example is Pinterest, which unlike Facebook, is not centered around people but is centered around interests and/or topics. It's essentially able to structure people based on their interest around these topics. CEO of Pinterest describes it as a catalog of ideas. Forums which are not considered as regular social networks, such as Facebook, Twitter, and others, are also very important social platforms. Unlike in Twitter or Facebook, Forum users are often anonymous in nature, which enables them to make in-depth conversations with communities. Other non-typical social networks are video sharing platforms, such as YouTube and Dailymotion. They are non-typical because they are centered around the user-generated content, and the social nature is generated by the sharing of these content on various social networks and also the discussion it generates around the user commentaries. Social media is gradually changing from platform centric to more experiences and features. In the future, we'll see more and more traditional content providers and services becoming social in nature through sharing and conversations. The term social media today includes not just social networks but every service that's social in nature with a wide audience. Delving into Social Data The data acquired from social media is called social data. The social data exists in many forms. The types of social media data can be information around the users of social networks, like name, city, interests, and so on. These types of data that are numeric or quantifiable are known as structured data. However, since Social Media are platforms for expression, hence, a lot of the data is in the form of texts, images, videos, and such. These sources are rich in information, but not as direct to analyze as structured data described earlier. These types of data are known as unstructured data. The process of applying rigorous methods to make sense of the social data is called social data analytics. We will go into great depth in social data analytics to demonstrate how we can extract valuable sense and information from these really interesting sources of social data. Since there are almost no restrictions on social media, there are lot of meaningless accounts, content, and interactions. So, the data coming out of these streams is quite noisy and polluted. Hence, a lot of effort is required to separate the information from the noise. Once the data is cleaned and we are focused on the most important and interesting aspects, we then require various statistical and algorithmic methods to make sense out of the filtered data and draw meaningful conclusions. Understanding the process Once you are familiar with the topic of social media data, let us proceed to the next phase. The first step is to understand the process involved in exploitation of data present on social networks. A proper execution of the process, with attention to small details, is the key to good results. In many computer science domains, a small error in code will lead to a visible or at least correctable dysfunction, but in data science, it will produce entirely wrong results, which in turn will lead to incorrect conclusions. The very first step of data analysis is always problem definition. Understanding the problem is crucial for choosing the right data sources and the methods of analysis. It also helps to realize what kind of information and conclusions we can infer from the data and what is impossible to derive. This part is very often underestimated while it is key to successful data analysis. Any question that we try to answer in a data science project has to be very precise. Some people tend to ask very generic questions, such as I want to find trends on Twitter. This is not a correct problem definition and an analysis based on such statement can fail in finding relevant trends. By a naïve analysis, we can get repeating Twitter ads and content generated by bots. Moreover, it raises more questions than it answers. In order to approach the problem correctly, we have to ask in the first step: what is a trend? what is an interesting trend for us? and what is the time scope? Once we answer these questions, we can break up the problem in multiple sub problems: I'm looking for the most frequent consumer reactions about my brand on Twitter in English over the last week and I want to know if they were positive or negative. Such a problem definition will lead to a relevant, valuable analysis with insightful conclusions. The next part of the process consists of getting the right data according to the defined problem. Many social media platforms allow users to collect a lot of information in an automatized way via APIs (Application Programming Interfaces), which is the easiest way to complete the task. Once the data is stored in a database, we perform the cleaning. This step requires a precise understanding of the project's goals. In many cases, it will involve very basic tasks such as duplicates removal, for example, retweets on Twitter, or more sophisticated such as spam detection to remove irrelevant comments, language detection to perform linguistic analysis, or other statistical or machine learning approaches that can help to produce a clean dataset. When the data is ready to be analyzed, we have to choose what kind of analysis and structure the data accordingly. If our goal is to understand the sense of the conversations, then it only requires a simple list of verbatims (textual data), but if we aim to perform analysis on different variables, like number of likes, dates, number of shares, and so on, the data should be combined in a structure such as data frame, where each row corresponds to an observation and each column to a variable. The choice of the analysis method depends on the objectives of the study and the type of data. It may require statistical or machine learning approach, or a specific approach to time series. Different approaches will be explained on the examples of Facebook, Twitter, YouTube, GitHub, Pinterest, and Forum data. Once the analysis is done, it's time to infer conclusions. We can derive conclusions based on the outputs from the models, but one of the most useful tools is visualization technique. Data and output can be presented in many different ways, starting from charts, plots, and diagrams through more complex 2D charts, to multidimensional visualizations. Project planning Analysis of content on social media can get very confusing due to difficulty of working on large amount of data and also trying to make sense out of it. For this reason, it's extremely important to ask the right questions in the beginning to get the right answers. Even though this is an exploratory approach, and getting exact answers may be difficult, the right questions allow you to define the scope, process and the time. The main questions that we will be working on are the following : What does Google post on Facebook ? How do people react to Google Posts ? (Likes, Shares and Comments) What do Google's audience say about Google and its ecosystem? What are the emotions expressed by Google's audience ? With the preceding questions in mind we will proceed to the next steps. Scope and process The analysis will consist of analyzing the feed of posts and comments on official Facebook page of Google. The process of information extraction is organized in a data flow. It starts with data extraction from API, data preprocessing and wrangling and is followed by a series of different analyses. The analysis becomes actionable only after the last step of results interpretation. In order to arrive at retrieving the above information we need to do the following : Extract all the posts of Google permitted by the Facebook API Extract the metadata for each posts : TimeStamp, Number of Likes, Number of Shares, Number of comments. Extract the user comments under each post and the metadata Process the posts to retrieve the most common keywords, bi-grams, hashtags Process the user comments using Alchemy API to retrieve the emotions Analyse the above information to derive conclusions Data type The main part of information extraction comes from an analysis of textual data (posts and comments). However, in order to add quantitative and temporal dimension, we process numbers (likes, shares) and dates (date of creation). Summary The avalanche of Social Network data is a result of communication platforms been developed since the last two decades. These are the platforms that evolved from chat rooms to personal information sharing and finally, social and professional networks. Among many Facebook, Twitter, Instagram, Pinterest and LinkedIn have emerged as the modern day Social Media. These platforms collectively have reach of more than a billon or more of individuals across the world sharing their activities and interaction with each other. Sharing of their data by these media through APIs and other technologies has given rise to a new field called Social Media Analytics. This has multiple applications such as in Marketing, Personalized recommendations, Research and Societal. Modern Data Science techniques such as Machine Learning and Text Mining are widely used for these applications. Python is one of the most widely used programming languages used for these techniques. However, manipulating the unstructured-data from Social Networks requires a lot of precise processing and preparation before coming to the most interesting bits. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media for Wordpress: VIP Memberships [article] Social Media Insight Using Naive Bayes [article]

0
0
2192

article-image-understanding-sap-analytics-cloud

Packt

10 Aug 2017

14 min read

Understanding SAP Analytics Cloud

Packt

10 Aug 2017

14 min read

0
0
9861

Packt

18 Jul 2017

20 min read

Machine Learning Review

Packt

18 Jul 2017

20 min read

0
0
17953

Packt

18 Jul 2017

15 min read

Up and Running with pandas

Packt

18 Jul 2017

15 min read

In this article by Michael Heydt, author of the book Learning Pandas - Second Edition, we will cover how to install pandas and start using its basic functionality. The content is provided as IPython and Jupyter notebooks, and hence we will also take a quick look at using both of those tools. We will utilize the Anaconda Scientific Python distribution from Continuum. Anaconda is a popular Python distribution with both free and paid components. Anaconda provides cross-platform support—including Windows, Mac, and Linux. The base distribution of Anaconda installs pandas, IPython and Jupyter notebooks, thereby making it almost trivial to get started. (For more resources related to this topic, see here.) IPython and Jupyter Notebook So far we have executed Python from the command line / terminal. This is the default Read-Eval-Print-Loop (REPL)that comes with Python. Let’s take a brief look at both. IPython IPython is an alternate shell for interactively working with Python. It provides several enhancements to the default REPL provided by Python. If you want to learn about IPython in more detail check out the documentation at https://ipython.org/ipython-doc/3/interactive/tutorial.html To start IPython, simply execute the ipython command from the command line/terminal. When started you will see something like the following: The input prompt which shows In [1]:. Each time you enter a statement in the IPythonREPL the number in the prompt will increase. Likewise, output from any particular entry you make will be prefaced with Out [x]:where x matches the number of the corresponding In [x]:.The following demonstrates: This numbering of in and out statements will be important to the examples as all examples will be prefaced with In [x]:and Out [x]: so that you can follow along. Note that these numberings are purely sequential. If you are following through the code in the text and occur errors in input or enter additional statements, the numbering may get out of sequence (they can be reset by exiting and restarting IPython). Please use them purely as reference. Jupyter Notebook Jupyter Notebook is the evolution of IPython notebook. It is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and markdown. The original IPython notebook was constrained to Python as the only language. Jupyter Notebook has evolved to allow many programming languages to be used including Python, R, Julia, Scala, and F#. If you want to take a deeper look at Jupyter Notebook head to the following URL http://jupyter.org/: where you will be presented with a page similar to the following: Jupyter Notebookcan be downloaded and used independently of Python. Anaconda does installs it by default. To start a Jupyter Notebookissue the following command at the command line/Terminal: $ jupyter notebook To demonstrate, let's look at how to run the example code that comes with the text. Download the code from the Packt website and and unzip the file to a directory of your choosing. In the directory, you will see the following contents: Now issue the jupyter notebook command. You should see something similar to the following: A browser page will open displaying the Jupyter Notebook homepage which is http://localhost:8888/tree. This will open a web browser window showing this page, which will be a directory listing: Clicking on a .ipynb link opens a notebook page like the following: The notebook that is displayed is HTML that was generated by Jupyter and IPython. It consists of a number of cells that can be one of 4 types: code, markdown, raw nbconvert or heading. Jupyter runs an IPython kernel for each notebook. Cells that contain Python code are executed within that kernel and the results added to the notebook as HTML. Double-clicking on any of the cells will make the cell editable. When done editing the contents of a cell, press shift-return, at which point Jupyter/IPython will evaluate the contents and display the results. If you wan to learn more about the notebook format that underlies the pages, see https://ipython.org/ipython-doc/3/notebook/nbformat.html. The toolbar at the top of a notebook gives you a number of abilities to manipulate the notebook. These include adding, removing, and moving cells up and down in the notebook. Also available are commands to run cells, rerun cells, and restart the underlying IPython kernel. To create a new notebook, go to File > New Notebook > Python 3 menu item. A new notebook page will be created in a new browser tab. Its name will be Untitled: The notebook consists of a single code cell that is ready to have Python entered. EnterIPython1+1 in the cell and press Shift + Enter to execute. The cell has been executed and the result shown as Out [2]:. Jupyter also opened a new cellfor you to enter more code or markdown. Jupyter Notebook automatically saves your changes every minute, but it's still a good thing to save manually every once and a while. One final point before we look at a little bit of pandas. Code in the text will be in the format of command -line IPython. As an example, the cell we just created in our notebook will be shown as follows: In [1]: 1+1 Out [1]: 2 Introducing the pandas Series and DataFrame Let’s jump into using some pandas with a brief introduction to pandas two main data structures, the Series and the DataFrame. We will examine the following: Importing pandas into your application Creating and manipulating a pandas Series Creating and manipulating a pandas DataFrame Loading data from a file into a DataFrame The pandas Series The pandas Series is the base data structure of pandas. A Series is similar to a NumPy array, but it differs by having an index which allows for much richer lookup of items instead of just a zero-based array index value. The following creates a Series from a Python list.: In [2]: # create a four item Series s = Series([1, 2, 3, 4]) s Out [2]:0 1 1 2 2 3 3 4 dtype: int64 The output consists of two columns of information. The first is the index and the second is the data in the Series. Each row of the output represents the index label (in the first column) and then the value associated with that label. Because this Series was created without specifying an index (something we will do next), pandas automatically creates an integer index with labels starting at 0 and increasing by one for each data item. The values of a Series object can then be accessed through the using the [] operator, paspassing the label for the value you require. The following gets the value for the label 1: In [3]:s[1] Out [3]: 2 This looks very much like normal array access in many programming languages. But as we will see, the index does not have to start at 0, nor increment by one, and can be many other data types than just an integer. This ability to associate flexible indexes in this manner is one of the great superpowers of pandas. Multiple items can be retrieved by specifying their labels in a python list. The following retrieves the values at labels 1 and 3: In [4]: # return a Series with the row with labels 1 and 3 s[[1, 3]] Out [4]: 1 2 3 4 dtype: int64 A Series object can be created with a user-defined index by using the index parameter and specifying the index labels. The following creates a Series with the same valuesbut with an index consisting of string values: In [5]:# create a series using an explicit index s = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd']) s Out [5}:a 1 b 2 c 3 d 4 dtype: int64 Data in the Series object can now be accessed by those alphanumeric index labels. The following retrieves the values at index labels 'a' and 'd': In [6]:# look up items the series having index 'a' and 'd' s[['a', 'd']] Out [6]:a 1 d 4 dtype: int64 It is still possible to refer to the elements of this Series object by their numerical 0-based position. : In [7]:# passing a list of integers to a Series that has # non-integer index labels will look up based upon # 0-based index like an array s[[1, 2]] Out [7]:b 2 c 3 dtype: int64 We can examine the index of a Series using the .index property: In [8]:# get only the index of the Series s.index Out [8]: Index(['a', 'b', 'c', 'd'], dtype='object') The index is itself actually a pandas objectand this output shows us the values of the index and the data type used for the index. In this case note that the type of the data in the index (referred to as the dtype) is object and not string. A common usage of a Series in pandas is to represent a time-series that associates date/time index labels withvalues. The following demonstrates by creating a date range can using the pandas function pd.date_range(): In [9]:# create a Series who's index is a series of dates # between the two specified dates (inclusive) dates = pd.date_range('2016-04-01', '2016-04-06') dates Out [9]:DatetimeIndex(['2016-04-01', '2016-04-02', '2016- 04-03', '2016-04-04', '2016-04-05', '2016-04-06'], dtype='datetime64[ns]', freq='D') This has created a special index in pandas referred to as a DatetimeIndex which is a specialized type of pandas index that is optimized to index data with dates and times. Now let's create a Series using this index. The data values represent high-temperatures on those specific days: In [10]:# create a Series with values (representing temperatures) # for each date in the index temps1 = Series([80, 82, 85, 90, 83, 87], index = dates) temps1 Out [10]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, dtype: int64 This type of Series with a DateTimeIndexis referred to as a time-series. We can look up a temperature on a specific data by using the date as a string: In [11]:temps1['2016-04-04'] Out [11]:90 Two Series objects can be applied to each other with an arithmetic operation. The following code creates a second Series and calculates the difference in temperature between the two: In [12]:# create a second series of values using the same index temps2 = Series([70, 75, 69, 83, 79, 77], index = dates) # the following aligns the two by their index values # and calculates the difference at those matching labels temp_diffs = temps1 - temps2 temp_diffs Out [12]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 The result of an arithmetic operation (+, -, /, *, …) on two Series objects that are non-scalar values returns another Series object. Since the index is not integerwe can also then look up values by 0-based value: In [13]:# and also possible by integer position as if the # series was an array temp_diffs[2] Out [13]: 16 Finally, pandas provides many descriptive statistical methods. As an example, the following returns the mean of the temperature differences: In [14]: # calculate the mean of the values in the Serie temp_diffs Out [14]: 9.0 The pandas DataFrame A pandas Series can only have a single value associated with each index label. To have multiple values per index label we can use a DataFrame. A DataFrame represents one or more Series objects aligned by index label. Each Series will be a column in the DataFrame, and each column can have an associated name.— In a way, a DataFrame is analogous to a relational database table in that it contains one or more columns of data of heterogeneous types (but a single type for all items in each respective column). The following creates a DataFrame object with two columns and uses the temperature Series objects: In [15]:# create a DataFrame from the two series objects temp1 # and temp2 and give them column names temps_df = DataFrame( {'Missoula': temps1, 'Philadelphia': temps2}) temps_df Out [15]: Missoula Philadelphia 2016-04-01 80 70 2016-04-02 82 75 2016-04-03 85 69 2016-04-04 90 83 2016-04-05 83 79 2016-04-06 87 77 The resulting DataFrame has two columns named Missoula and Philadelphia. These columns are new Series objects contained within the DataFrame with the values copied from the original Series objects. Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column: In [16]:# get the column with the name Missoula temps_df['Missoula'] Out [16]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 And the following code retrieves the Philadelphia column: In [17]:# likewise we can get just the Philadelphia column temps_df['Philadelphia'] Out [17]:2016-04-01 70 2016-04-02 75 2016-04-03 69 2016-04-04 83 2016-04-05 79 2016-04-06 77 Freq: D, Name: Philadelphia, dtype: int64 A Python list of column names can also be used to return multiple columns.: In [18]:# return both columns in a different order temps_df[['Philadelphia', 'Missoula']] Out [18]: Philadelphia Missoula 2016-04-01 70 80 2016-04-02 75 82 2016-04-03 69 85 2016-04-04 83 90 2016-04-05 79 83 2016-04-06 77 87 There is a subtle difference in a DataFrame object as compared to a Series object. Passing a list to the [] operator of DataFrame retrieves the specified columns whereas aSerieswould return rows. If the name of a column does not have spaces it can be accessed using property-style: In [19]:# retrieve the Missoula column through property syntax temps_df.Missoula Out [19]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 Arithmetic operations between columns within a DataFrame are identical in operation to those on multiple Series. To demonstrate, the following code calculates the difference between temperatures using property notation: In [20]:# calculate the temperature difference between the two #cities temps_df.Missoula - temps_df.Philadelphia Out [20]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following adds a new column in the DataFrame with the temperature differences: In [21]:# add a column to temp_df which contains the difference # in temps temps_df['Difference'] = temp_diffs temps_df Out [21]: Missoula Philadelphia Difference 2016-04-01 80 70 10 2016-04-02 82 75 7 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 The names of the columns in a DataFrame are accessible via the.columns property. : In [22]:# get the columns, which is also an Index object temps_df.columns Out [22]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') The DataFrame (and Series) objects can be sliced to retrieve specific rows. The following slices the second through fourth rows of temperature difference values: In [23]:# slice the temp differences column for the rows at # location 1 through 4 (as though it is an array) temps_df.Difference[1:4] Out [23]: 2016-04-02 7 2016-04-03 16 2016-04-04 7 Freq: D, Name: Difference, dtype: int64 Entire rows from a DataFrame can be retrieved using the .loc and .iloc properties. .loc ensures that the lookup is by index label, where .iloc uses the 0-based position. - The following retrieves the second row of the DataFrame. In[24]:# get the row at array position 1 temps_df.iloc[1] Out [24]:Missoula 82 Philadelphia 75 Difference 7 Name: 2016-04-02 00:00:00, dtype: int64 Notice that this result has converted the row into a Series with the column names of the DataFrame pivoted into the index labels of the resulting Series.The following shows the resulting index of the result. In [25]:# the names of the columns have become the index # they have been 'pivoted' temps_df.iloc[1].index Out [25]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label: In [26]:# retrieve row by index label using .loc temps_df.loc['2016-04-05'] Out [26]:Missoula 83 Philadelphia 79 Difference 4 Name: 2016-04-05 00:00:00, dtype: int64 Specific rows in a DataFrame object can be selected using a list of integer positions. The following selects the values from the Differencecolumn in rows at integer-locations 1, 3, and 5: In [27]:# get the values in the Differences column in rows 1, 3 # and 5using 0-based location temps_df.iloc[[1, 3, 5]].Difference Out [27]:2016-04-02 7 2016-04-04 7 2016-04-06 10 Freq: 2D, Name: Difference, dtype: int64 Rows of a DataFrame can be selected based upon a logical expression that is applied to the data in each row. The following shows values in the Missoulacolumn that are greater than 82 degrees: In [28]:# which values in the Missoula column are > 82? temps_df.Missoula > 82 Out [28]:2016-04-01 False 2016-04-02 False 2016-04-03 True 2016-04-04 True 2016-04-05 True 2016-04-06 True Freq: D, Name: Missoula, dtype: bool The results from an expression can then be applied to the[] operator of a DataFrame (and a Series) which results in only the rows where the expression evaluated to Truebeing returned: In [29]:# return the rows where the temps for Missoula > 82 temps_df[temps_df.Missoula > 82] Out [29]: Missoula Philadelphia Difference 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 This technique is referred to as boolean election in pandas terminologyand will form the basis of selecting rows based upon values in specific columns (like a query in —SQL using a WHERE clause - but as we will see also being much more powerful). Visualization We will dive into visualization in quite some depth in Chapter 14, Visualization, but prior to then we will occasionally perform a quick visualization of data in pandas. Creating a visualization of data is quite simple with pandas. All that needs to be done is to call the .plot() method. The following demonstrates by plotting the Close value of the stock data. In [40]: df[['Close']].plot(); Summary In this article, took an introductory look at the pandas Series and DataFrame objects, demonstrating some of the fundamental capabilities. This exposition showed you how to perform a few basic operations that you can use to get up and running with pandas prior to diving in and learning all the details. Resources for Article: Further resources on this subject: Using indexes to manipulate pandas objects [article] Predicting Sports Winners with Decision Trees and pandas [article] The pandas Data Structures [article]

0
0
3151

Packt

11 Jul 2017

19 min read

Massive Graphs on Big Data

Packt

11 Jul 2017

19 min read

In this article by Rajat Mehta, author of the book Big Data Analytics with Java, we will learn about graphs. Graphs theory is one of the most important and interesting concepts of computer science. Graphs have been implemented in real life in a lot of use cases. If you use a GPS on your phone or a GPS device and it shows you a driving direction to a place, behind the scene there is an efficient graph that is working for you to give you the best possible direction. In a social network you are connected to your friends and your friends are connected to other friends and so on. This is a massive graph running in production in all the social networks that you use. You can send messages to your friends or follow them or get followed all in this graph. Social networks or a database storing driving directions all involve massive amounts of data and this is not data that can be stored on a single machine, instead this is distributed across a cluster of thousands of nodes or machines. This massive data is nothing but big data and in this article we will learn how data can be represented in the form of a graph so that we make analysis or deductions on top of these massive graphs. In this article, we will cover: A small refresher on the graphs and its basic concepts A small introduction on graph analytics, its advantages and how Apache Spark fits in Introduction on the GraphFrames library that is used on top of Apache Spark Before we dive deeply into each individual section, let's look at the basic graph concepts in brief. (For more resources related to this topic, see here.) Refresher on graphs In this section, we will cover some of the basic concepts of graphs, and this is supposed to be a refresher section on graphs. This is a basic section, hence if some users already know this information they can skip this section. Graphs are used in many important concepts in our day-to-day lives. Before we dive into the ways of representing a graph, let's look at some of the popular use cases of graphs (though this is not a complete list) Graphs are used heavily in social networks In finding driving directions via GPS In many recommendation engines In fraud detection in many financial companies In search engines and in network traffic flows In biological analysis As you must have noted earlier, graphs are used in many applications that we might be using on a daily basis. Graphs is a form of a data structure in computer science that helps in depicting entities and the connection between them. So, if there are two entities such as Airport A and Airport B and they are connected by a flight that takes, for example, say a few hours then Airport A and Airport B are the two entities and the flight connecting between them that takes those specific hours depict the weightage between them or their connection. In formal terms, these entities are called as vertexes and the relationship between them are called as edges. So, in mathematical terms, graph G = {V, E},that is, graph is a function of vertexes and edges. Let's look at the following diagram for the simple example of a graph: As you can see, the preceding graph is a set of six vertexes and eight edges,as shown next: Vertexes = {A, B, C, D, E, F} Edges = { AB, AF, BC, CD, CF, DF, DE, EF} These vertexes can represent any entities, for example, they can be places with the edges being 'distances' between the places or they could be people in a social network with the edges being the type of relationship, for example, friends or followers. Thus, graphs can represent real-world entities like this. The preceding graph is also called a bidirected graph because in this graph the edges go in either direction that is the edge from A to B can be traversed both ways from A to B as well as from B to A. Thus, the edges in the preceding diagrams that is AB can be BA or AF can be FA too. There are other types of graphs called as directed graphs and in these graphs the direction of the edges go in one way only and does not retrace back. A simple example of a directed graph is shown as follows:. As seen in the preceding graph, the edge A to B goes only in one direction as well as the edge B to C. Hence, this is a directed graph. A simple linked list data structure or a tree datastructure are also forms of graph only. In a tree, nodes can have children only and there are no loops, while there is no such rule in a general graph. Representing graphs Visualizing a graph makes it easily comprehensible but depicting it using a program requires two different approaches Adjacency matrix: Representing a graph as a matrix is easy and it has its own advantages and disadvantages. Let's look at the bidirected graph that we showed in the preceding diagram. If you would represent this graph as a matrix, it would like this: The preceding diagram is a simple representation of our graph in matrix form. The concept of matrix representation of graph is simple—if there is an edge to a node we mark the value as 1, else, if the edge is not present, we mark it as 0. As this is a bi-directed graph, it has edges flowing in one direction only. Thus, from the matrix, the rows and columns depict the vertices. There if you look at the vertex A, it has an edge to vertex B and the corresponding matrix value is 1. As you can see, it takes just one step or O[1]to figure out an edge between two nodes. We just need the index (rows and columns) in the matrix and we can extract that value from it. Also, if you would have looked at the matrix closely, you would have seen that most of the entries are zero, hence this is a sparse matrix. Thus, this approach eats a lot of space in computer memory in marking even those elements that do not have an edge to each other, and this is its main disadvantage. Adjacency list: Adjacency list solves the problem of space wastage of adjacency matrix. To solve this problem, it stores the node and its neighbors in a list (linked list) as shown in the following diagram: For maintaining brevity, we have not shown all the vertices but you can make out from the diagram that each vertex is storing its neighbors in a linked list. So when you want to figure out the neighbors of a particular vertex, you can directly iterate over the list. Of course this has the disadvantage of iterating when you have to figure out whether an edge exists between two nodes or not. This approach is also widely used in many algorithms in computer science. We have briefly seen how graphs can be represented, let's now see some important terms that are used heavily on graphs. Common terminology on graphs We will now introduce you to some common terms and concepts in graphs that you can use in your analytics on top of graphs: Vertices:As we mentioned earlier, vertices are the mathematical terms for the nodes in a graph. For analytic purposes, thevertices count shows the number of nodes in the system, for example, the number of people involved in a social graph. Edges: As we mentioned earlier, edges are the connection between vertices and edges can carry weights. The number of edges represent the number of relations in a system of graph. The weight on a graph represents the intensity of the relationship between the nodes involved; for example, in a social network, the relationship of friends is a stronger relationship than followed between nodes. Degrees: Represent the total number of connections flowing into as well as out of a node. For example, in the previous diagram the degree of node F is four. The degree count is useful, for example, in a social network graph it can represent how well a person is connected if his degree count is very high. Indegrees: This represents the number of connections flowing into a node. For example, in the previous diagram, for node F the indegree value is three. In a social network graph, this might represent how many people can send messages to this person or node. Oudegrees: This represents the number of connections flowing out of a node. For example, in the previous diagram, for node F the outdegree value is one. In a social network graph, this might represent how many people can send messages to this person or node. Common algorithms on graphs Let's look at the three common algorithms that are run on graphs frequently and some of their uses: Breadth first search: Breadth first search is an algorithm for graph traversal or searching. As the name suggests, the traversal occurs across the breadth of the graphs that is to say the neighbors of the node form where traversal starts are searched first before exploring further in the same manner. We will refer to the same graph we used earlier: If we start at vertex A, then according to breadth first search next we search or go at the neighbors of A that's B and F. After that, we will go at the neighbors of B and that will be C. Next we will go to the neighbours of F and those will be E and D. We only go through each node once and this mimics real-life travel as well, as to reach from a point to another point we seldom cover the same road or path again. Thus, our breadth first traversal starting from A will be {A , B , F , C , D , E }. Breadth first search is very useful in graph analytics and can tell us things such as the friends that are not your immediate friends but just at the next level after your immediate friends in a social network or in the case of a graph of a flights network it can show flights with just a single stop or two stops to the destination. Depth first search: This is another way of searching where we start from the source vertice and keep on searching until we reach the end node or the leaf node and then we backtrack. This algorithm is not as performant as the bread first search as it requires lots of traversals. So if you want to know if a node A is connected to node B, you might end up searching along a lot of wasteful nodes that do not have anything to do with the original nodes A and B before coming at the appropriate solution. Dijkstra's shortest path: This is a greedy algorithm to find the shortest path in a graph network. So in a weighted graph, if you need to find the shortest path between two nodes, you can start from the starting node and keep on picking the next node in path greedily to be the one with the least weight (in the case of weights being distances between nodes like in city graphs depicting interconnecting cities and roads).So in a road network, you can find the shortest path between two cities using this algorithm. PageRank algorithm: This is a very popular algorithm that came out from Google and it essentially is used to find the importance of a web page by figuring out how connected it is to other important websites. It gives a page rank score to each of the websites based on this approach and finally the search results are built based on this score. The best part about this algorithm is it can be applied to other areas in life too, for example, in figuring out the important airports in a flight graph, or figuring out the most important people in a social network group. So much for the basics and refresher on graphs, in the next section, we will now see how graphs can be used in real world in massive datasets such as social network data or in data used in the field of biology. We will also study how graph analytics can be used on top of these graphs to derive exclusive deductions. Plotting graphs There is a handy open source Java library called GraphStream, which can be used to plot graphs and this is very useful specially if you want to view the structure of your graphs. While viewing, you can also figure out if some of the vertices are very close to each other (clustered) or in general how they are placed. Using the GraphStream library is easy. Just download the jar from http://graphstream-project.org and put it in the classpath of your project. Next, we will show a simple example demonstrating how easy it is to plot a graph using this library. Just create an instance of a graph. For our example, we will create a simple DefaultGraph and name it SimpleGraph. Next, we will add the nodes or vertices of the graph. We will also add the attribute of the label that is displayed on the vertice. Graph graph = newDefaultGraph("SimpleGraph"); graph.addNode("A" ).setAttribute("ui.label", "A"); graph.addNode("B" ).setAttribute("ui.label", "B"); graph.addNode("C" ).setAttribute("ui.label", "C"); After building the nodes, it's now time to connect these nodes using the edges. The API is simple to use and on the graph instance we can define the edges, provided an ID is given to them and the starting and ending nodes are also given. graph.addEdge("AB", "A", "B"); graph.addEdge("BC", "B", "C"); graph.addEdge("CA", "C", "A"); All the information of nodes and edges is present on the graph instance. It's now time to plot this graph on the UI and we can just invoke the display method on the graph instance as shown next and display it on the UI. graph.display(); This would plot the graph on the UI as follows: This library is extensive and it will be good learning experience to explore this library further and we would urge the readers to further explore this library on their own. Massive graphs on big data Big data comprises huge amount of data distributed across a cluster of thousands (if not more) of machines. Building graphs based on this massive data has different challenges shown as follows: Due to the vast amount of data involved, the data for the graph is distributed across a cluster of machines. Hence, in actuality, it's not a single node graph and we have to build a graph that spans across a cluster of machines. A graph that spans across a cluster of machines would have vertices and edges spread across different machines and this data in a graph won't fit into the memory of one single machine. Consider your friend's list on Facebook; some of your friend's data in your Facebook friend list graph might lie on different machines and this data might be just tremendous in size. Look at an example diagram of a graph of 10 Facebook friends and their network shown as follows: As you can see in the preceding diagram, when for just 10 friends the data can be huge, and here since the graph is drawn by hand we have not even shown a lot of connections to make the image comprehensible, but in real life each person can have say more than thousands of connections. So imagine what will happen to a graph with say thousands if not more people on the list. As shown in the reasons we just saw, building massive graphs on big data is a different ball game altogether and there are few main approaches for building this massive graphs. From the perspective of big data building the massive graphs involve running and storing data parallely on many nodes. The two main approaches are bulk synchronous parallely and the pregel approach. Apache Spark follows the pregel approach. Covering these approaches in detail is out of scope of this book and if the users are interested more on these topics they should refer to other books and the Wikipedia for the same. Graph analytics The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can't do by relational databases. Let's try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relation database are still good on many other use cases). As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too) Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it's currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs.It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. Summary In this article, we learned about graph analytics. We saw how graphs can be built even on top of massive big datasets. We learned how Apache Spark can be used to build these massive graphs and in the process we learned about the new library GraphFrames that helps us in building these graphs. Resources for Article: Further resources on this subject: Saying Hello to Java EE [article] Object-Oriented JavaScript [article] Introduction to JavaScript [article]

0
0
25378

How-To Tutorials - Data

Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

Pattern mining using Spark MLlib - Part 2

Pattern Mining using Spark MLlib - Part 1

Building a classification system with Decision Trees in Apache Spark 2.0

Building Motion Charts with Tableau

(13*3)+ Halloween costume ideas for Data science nerds

Halloween costume ideas inspired from Apache Big Data Projects

Implementing Autoencoders using H2O

Top 5 Machine Learning Movies

Machine Learning Models

Trending Topics

Introduction to the Latest Social Media Landscape and Importance

Understanding SAP Analytics Cloud

Machine Learning Review

Up and Running with pandas

Massive Graphs on Big Data

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading