Data | 0 articles | Tech News, Tutorials & Expert Insights

01 Apr 2015

7 min read

Factor variables in R

01 Apr 2015

This article by Jaynal Abedin and Kishor Kumar Das, authors of the book Data Manipulation with R Second Edition, will discuss factor variables in R. In any data analysis task, the majority of the time is dedicated to data cleaning and preprocessing. Sometimes, it is considered that about 80 percent of the effort is devoted to data cleaning before conducting the actual analysis. Also, in real-world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, it is known as a factor. Working with categorical variables in R is a bit technical, and in this article, we have tried to demystify this process of dealing with categorical variables. (For more resources related to this topic, see here.) During data analysis, the factor variable sometimes plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes, we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. The split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data, it is always preferable to perform the operation within a subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large-scale data, it requires considerable amount of coding and eventually takes a longer time to process. In the case of big data, we can split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient, and to overcome this limitation, Wickham developed an R package, plyr, where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break down a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we can compare it with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were not previously connected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. The plyr package works on every type of data structure, whereas the dplyr package is designed to work only on data frames. The dplyr package offers a complete set of functions to perform every kind of data manipulation we would need in the process of analysis. These functions take a data frame as the input and also produce a data frame as the output, hence the name dplyr. There are two different types of functions in the dplyr package: single-table and aggregate. The single-table function takes a data frame as the input and an action such as subsetting a data frame, generating new columns in the data frame, or rearranging a data frame. The aggregate function takes a column as the input and produces a single value as the output, which is mostly used to summarize columns. These functions do not allow us to perform any group-wise operation, but a combination of these functions with the group_by() function allows us to implement the split-apply-combine approach. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping, and we need to implement some reorientation to perform certain types of analyses. A dataset's layout could be long or wide. In a long layout, multiple rows represent a single subject's record, whereas in a wide layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset, and a column represents the information for each characteristic for all subjects. This means that rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier, and others are measured characteristics. For the purpose of reshaping, we can group the variables into two groups: identifier variables and measured variables: The identifier variables: These help us identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terminology, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. The measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mixture of both. Now, beyond this typical structure of a dataset, we can think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which the measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm, this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Text processing Text data is one of the most important areas in the field of data analytics. Nowadays, we are producing a huge amount of text data through various media every day; for example, Twitter posts, blog writing, and Facebook posts are all major sources of text data. Text data can be used to retrieve information, in sentiment analysis and even entity recognition. Summary This article briefly explained the factor variables, the split-apply-combine strategy, reshaping a dataset in R, and text processing. Resources for Article: Further resources on this subject: Introduction to S4 Classes [Article] Warming Up [Article] Driving Visual Analyses with Automobile Data (Python) [Article]

0
0
5065

article-image-pattern-mining-using-spark-part-1

Aarthi Kumaraswamy

03 Nov 2017

15 min read

Pattern Mining using Spark MLlib - Part 1

Aarthi Kumaraswamy

03 Nov 2017

15 min read

[box type="note" align="" class="" width=""]The following two-part tutorial is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this tutorial is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides. In this tutorial, we will look into doing pattern mining in Spark. The tutorial is split up into two main sections. In the first, we will first introduce the three available pattern mining algorithms that Spark currently comes with and then apply them to an interesting dataset. In particular, you will learn the following from this two-part tutorial: The basic principles of frequent pattern mining. Useful and relevant data formats for applications. Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span. Frequent pattern mining When presented with a new data set, a natural sequence of questions is: What kind of data do we look at; that is, what structure does it have? Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data? How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it? On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data. To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item. Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three: Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often brought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system. Based on the previous question, are there any interesting implications or rules to observe in shopping behavior?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean. Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items. The reason we focus on these three questions, in particular, is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP- growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection. Pattern mining terminology We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions. To give a concrete example of this, consider the following situation. Assume that the full item set to shop from is given by I = {bread, cheese, ananas, eggs, donuts, fish, pork, milk, garlic, ice cream, lemon, oil, honey, jam, kale, salt}. Since we will look at a lot of item subsets, to make things more readable later on, we will simply abbreviate these items by their first letter, that is, we'll write I = {b, c, a, e, d, f, p, m, g, i, l, o, h, j, k, s}. Given these items, a small transaction database D could look as follows: Transaction ID Transaction 1 a, c, d, f, g, i, m, p 2 a, b, c, f, l, m, o 3 b, f, h, j, o 4 b, c, k, s, p 5 a, c, e, f, l, m, n, p Table 1: A small shopping cart database with five transactions Frequent pattern mining problem Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D: supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D| We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items. Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance. Given a minimum support threshold, the problem of finding all the frequent patterns is called the frequent pattern mining problem and it is, in fact, the formalized version of the aforementioned first question. Continuing with our example, we have found all frequent patterns of length 1 for t = 0.6 already. How do we find longer patterns? On a theoretical level, given unlimited resources, this is not much of a problem, since all we need to do is count the occurrences of items. On a practical level, however, we need to be smart about how we do so to keep the computation efficient. Especially for databases large enough for Spark to come in handy, it can be very computationally intense to address the frequent pattern mining problem. One intuitive way to go about this is as follows: Find all the frequent patterns of length 1, which requires one full database scan. This is how we started with in our preceding example. For patterns of length 2, generate all the combinations of frequent 1-patterns, the so-called candidates, and test if they exceed the minimum support by doing another scan of D. Importantly, we do not have to consider the combinations of infrequent patterns, since patterns containing infrequent patterns can not become frequent. This rationale is called the apriori principle. For longer patterns, continue this procedure iteratively until there are no more patterns left to combine. This algorithm, using a generate-and-test approach to pattern mining and utilizing the apriori principle to bound combinations, is called the apriori algorithm. There are many variations of this baseline algorithm, all of which share similar drawbacks in terms of scalability. For instance, multiple full database scans are necessary to carry out the iterations, which might already be prohibitively expensive for huge datasets. On top of that, generating candidates themselves is already expensive, but computing their combinations might simply be infeasible. In the next section, we will see how a parallel version of an algorithm called FP-growth, available in Spark, can overcome most of the problems just discussed. The association rule mining problem To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows: conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1) This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity. In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case: {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p} {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p} {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m} {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p} {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m} From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront. [box type="info" align="" class="" width=""]At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http:/ / www. cse. msu. edu/ ~ptan/ papers/ IS. pdf.[/box] The sequential pattern mining problem Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows: s = <s1, s2, ..., sl> Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example: Sequence ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Table 2: A small sequence database with four short sequences. In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u1, ..., un) a subsequence of s = (s1, ..., sl) and write u < s if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following: u1 < si1, ..., un < sin Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>. With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows: suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S| Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t. Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section. Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced. If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.

0
0
5041

article-image-sql-tuning-enhancements-oracle-12c

Packt

13 Dec 2016

13 min read

SQL Tuning Enhancements in Oracle 12c

Packt

13 Dec 2016

13 min read

0
0
5035

article-image-how-to-implement-discrete-convolution-on-a-2d-dataset

Pravin Dhandre

08 Dec 2017

7 min read

How to implement discrete convolution on a 2D dataset

Pravin Dhandre

08 Dec 2017

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book by Rodolfo Bonnin titled Machine Learning for Developers. Surprisingly the question frequently asked by developers across the globe is, “How do I get started in Machine Learning?”. One reason could be attributed to the vastness of the subject area. This book is a systematic guide teaching you how to implement various Machine Learning techniques and their day-to-day application and development. [/box] In the tutorial given below, we have implemented convolution in a practical example to see it applied to a real image and get intuitive ideas of its effect.We will use different kernels to detect high-detail features and execute subsampling operation to get optimized and brighter image. This is a simple intuitive implementation of discrete convolution concept by applying it to a sample image with different types of kernel. Let's import the required libraries. As we will implement the algorithms in the clearest possible way, we will just use the minimum necessary ones, such as NumPy: import matplotlib.pyplot as plt import imageio import numpy as np Using the imread method of the imageio package, let's read the image (imported as three equal channels, as it is grayscale). We then slice the first channel, convert it to a floating point, and show it using matplotlib: arr = imageio.imread("b.bmp") [:,:,0].astype(np.float) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) plt.show() Now it's time to define the kernel convolution operation. As we did previously, we will simplify the operation on a 3 x 3 kernel in order to better understand the border conditions. apply3x3kernel will apply the kernel over all the elements of the image, returning a new equivalent image. Note that we are restricting the kernels to 3 x 3 for simplicity, and so the 1 pixel border of the image won't have a new value because we are not taking padding into consideration: class ConvolutionalOperation: def apply3x3kernel(self, image, kernel): # Simple 3x3 kernel operation newimage=np.array(image) for m in range(1,image.shape[0]-2): for n in range(1,image.shape[1]-2): newelement = 0 for i in range(0, 3): for j in range(0, 3): newelement = newelement + image[m - 1 + i][n - 1+ j]*kernel[i][j] newimage[m][n] = newelement return (newimage) As we saw in the previous sections, the different kernel configurations highlight different elements and properties of the original image, building filters that in conjunction can specialize in very high-level features after many epochs of training, such as eyes, ears, and doors. Here, we will generate a dictionary of kernels with a name as the key, and the coefficients of the kernel arranged in a 3 x 3 array. The Blur filter is equivalent to calculating the average of the 3 x 3 point neighborhood, Identity simply returns the pixel value as is, Laplacian is a classic derivative filter that highlights borders, and then the two Sobel filters will mark horizontal edges in the first case, and vertical ones in the second case: kernels = {"Blur":[[1./16., 1./8., 1./16.], [1./8., 1./4., 1./8.], [1./16., 1./8., 1./16.]] ,"Identity":[[0, 0, 0], [0., 1., 0.], [0., 0., 0.]] ,"Laplacian":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]] ,"Left Sobel":[[1., 0., -1.], [2., 0., -2.], [1., 0., -1.]] ,"Upper Sobel":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]]} Let's generate a ConvolutionalOperation object and generate a comparative kernel graphical chart to see how they compare: conv = ConvolutionalOperation() plt.figure(figsize=(30,30)) fig, axs = plt.subplots(figsize=(30,30)) j=1 for key,value in kernels.items(): axs = fig.add_subplot(3,2,j) out = conv.apply3x3kernel(arr, value) plt.imshow(out, cmap=plt.get_cmap('binary_r')) j=j+1 plt.show() <matplotlib.figure.Figure at 0x7fd6a710a208> In the final image you can clearly see how our kernel has detected several high-detail features on the image—in the first one, you see the unchanged image because we used the unit kernel, then the Laplacian edge finder, the left border detector, the upper border detector, and then the blur operator: Having reviewed the main characteristics of the convolution operation for the continuous and discrete fields, we can conclude by saying that, basically, convolution kernels highlight or hide patterns. Depending on the trained or (in our example) manually set parameters, we can begin to discover many elements in the image, such as orientation and edges in different dimensions. We may also cover some unwanted details or outliers by blurring kernels, for example. Additionally, by piling layers of convolutions, we can even highlight higher-order composite elements, such as eyes or ears. This characteristic of convolutional neural networks is their main advantage over previous data-processing techniques: we can determine with great flexibility the primary components of a certain dataset, and represent further samples as a combination of these basic building blocks. Now it's time to look at another type of layer that is commonly used in combination with the former—the pooling layer. Subsampling operation (pooling) The subsampling operation consists of applying a kernel (of varying dimensions) and reducing the extension of the input dimensions by dividing the image into mxn blocks and taking one element representing that block, thus reducing the image resolution by some determinate factor. In the case of a 2 x 2 kernel, the image size will be reduced by half. The most well-known operations are maximum (max pool), average (avg pool), and minimum (min pool). The following image gives you an idea of how to apply a 2 x 2 maxpool kernel, applied to a one-channel 16 x 16 matrix. It just maintains the maximum value of the internal zone it covers: Now that we have seen this simple mechanism, let's ask ourselves, what's the main purpose of it? The main purpose of subsampling layers is related to the convolutional layers: to reduce the quantity and complexity of information while retaining the most important information elements. In other word, they build a compact representation of the underlying information. Now it's time to write a simple pooling operator. It's much easier and more direct to write than a convolutional operator, and in this case we will only be implementing max pooling, which chooses the brightest pixel in the 4 x 4 vicinity and projects it to the final image: class PoolingOperation: def apply2x2pooling(self, image, stride): # Simple 2x2 kernel operation newimage=np.zeros((int(image.shape[0]/2),int(image.shape[1]/2)),np.float32) for m in range(1,image.shape[0]-2,2): for n in range(1,image.shape[1]-2,2): newimage[int(m/2),int(n/2)] = np.max(image[m:m+2,n:n+2]) return (newimage) Let's apply the newly created pooling operation, and as you can see, the final image resolution is much more blocky, and the details, in general, are brighter: plt.figure(figsize=(30,30)) pool=PoolingOperation() fig, axs = plt.subplots(figsize=(20,10)) axs = fig.add_subplot(1,2,1) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) out=pool.apply2x2pooling(arr,1) axs = fig.add_subplot(1,2,2) plt.imshow(out, cmap=plt.get_cmap('binary_r')) plt.show() Here you can see the differences, even though they are subtle. The final image is of lower precision, and the chosen pixels, being the maximum of the environment, produce a brighter image: This simple implementation with various kernels simplified the working mechanism of discrete convolution operation on a 2D dataset. Using various kernels and subsampling operation, the hidden patterns of dataset are unveiled and the image is made more sharpened, with maximum pixels and much brighter image thereby producing compact representation of the dataset. If you found this article interesting, do check Machine Learning for Developers and get to know about the advancements in deep learning, adversarial networks, popular programming frameworks to prepare yourself in the ubiquitous field of machine learning.

0
0
4998

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout

Packt

30 Mar 2015

33 min read

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt

30 Mar 2015

33 min read

0
0
4995

Packt

03 Feb 2016

9 min read

Customizing IPython

Packt

03 Feb 2016

9 min read

In this article written by Cyrille Rossant, author of Learning IPython for Interactive Computing and Data Visualization - Second edition, we look at how the Jupyter Notebook is a highly-customizable platform. You can configure many aspects of the software and can extend the backend (kernels) and the frontend (HTML-based Notebook). This allows you to create highly-personalized user experiences based on the Notebook. In this article, we will cover the following topics: Creating a custom magic command in an IPython extension Writing a new Jupyter kernel Customizing the Notebook interface with JavaScript Creating a custom magic command in an IPython extension IPython comes with a rich set of magic commands. You can get the complete list with the %lsmagic command. IPython also allows you to create your own magic commands. In this section, we will create a new cell magic that compiles and executes C++ code in the Notebook. We first import the register_cell_magic function: In [1]: from IPython.core.magic import register_cell_magic To create a new cell magic, we create a function that takes a line (containing possible options) and a cell's contents as its arguments, and we decorate it with @register_cell_magic, as shown here: In [2]: @register_cell_magic def cpp(line, cell): """Compile, execute C++ code, and return the standard output.""" # We first retrieve the current IPython interpreter # instance. ip = get_ipython() # We define the source and executable filenames. source_filename = '_temp.cpp' program_filename = '_temp' # We write the code to the C++ file. with open(source_filename, 'w') as f: f.write(cell) # We compile the C++ code into an executable. compile = ip.getoutput("g++ {0:s} -o {1:s}".format( source_filename, program_filename)) # We execute the executable and return the output. output = ip.getoutput('./{0:s}'.format(program_filename)) print('n'.join(output)) C++ compiler This recipe requires the gcc C++ compiler. On Ubuntu, type sudo apt-get install build-essential in a terminal. On OS X, install Xcode. On Windows, install MinGW (http://www.mingw.org) and make sure that g++ is in your system path. This magic command uses the getoutput() method of the IPython InteractiveShell instance. This object represents the current interactive session. It defines many methods for interacting with the session. You will find the comprehensive list at http://ipython.org/ipython-doc/dev/api/generated/IPython.core.interactiveshell.html#IPython.core.interactiveshell.InteractiveShell. Let's now try this new cell magic. In [3]: %%cpp #include<iostream> int main() { std::cout << "Hello world!"; } Out[3]: Hello world! This cell magic is currently only available in your interactive session. To distribute it, you need to create an IPython extension. This is a regular Python module or package that extends IPython. To create an IPython extension, copy the definition of the cpp() function (without the decorator) to a Python module, named cpp_ext.py for example. Then, add the following at the end of the file: def load_ipython_extension(ipython): """This function is called when the extension is loaded. It accepts an IPython InteractiveShell instance. We can register the magic with the `register_magic_function` method of the shell instance.""" ipython.register_magic_function(cpp, 'cell') Then, you can load the extension with %load_ext cpp_ext. The cpp_ext.py file needs to be in the PYTHONPATH, for example in the current directory. Writing a new Jupyter kernel Jupyter supports a wide variety of kernels written in many languages, including the most-frequently used IPython. The Notebook interface lets you choose the kernel for every notebook. This information is stored within each notebook file. The jupyter kernelspec command allows you to get information about the kernels. For example, jupyter kernelspec list lists the installed kernels. Type jupyter kernelspec --help for more information. At the end of this section, you will find references with instructions to install various kernels including IR, IJulia, and IHaskell. Here, we will detail how to create a custom kernel. There are two methods to create a new kernel: Writing a kernel from scratch for a new language by re-implementing the whole Jupyter messaging protocol. Writing a wrapper kernel for a language that can be accessed from Python. We will use the second, easier method in this section. Specifically, we will reuse the example from the last section to write a C++ wrapper kernel. We need to slightly refactor the last section's code because we won't have access to the InteractiveShell instance. Since we're creating a kernel, we need to put the code in a Python script in a new folder named cpp: In [1]: %mkdir cpp The %%writefile cell magic lets us create a cpp_kernel.py Python script from the Notebook: In [2]: %%writefile cpp/cpp_kernel.py import os import os.path as op import tempfile # We import the `getoutput()` function provided by IPython. # It allows us to do system calls from Python. from IPython.utils.process import getoutput def exec_cpp(code): """Compile, execute C++ code, and return the standard output.""" # We create a temporary directory. This directory will # be deleted at the end of the 'with' context. # All created files will be in this directory. with tempfile.TemporaryDirectory() as tmpdir: # We define the source and executable filenames. source_path = op.join(tmpdir, 'temp.cpp') program_path = op.join(tmpdir, 'temp') # We write the code to the C++ file. with open(source_path, 'w') as f: f.write(code) # We compile the C++ code into an executable. os.system("g++ {0:s} -o {1:s}".format( source_path, program_path)) # We execute the program and return the output. return getoutput(program_path) Out[2]: Writing cpp/cpp_kernel.py Now we create our wrapper kernel by appending some code to the cpp_kernel.py file created above (that's what the -a option in the %%writefile cell magic is for): In [3]: %%writefile -a cpp/cpp_kernel.py """C++ wrapper kernel.""" from ipykernel.kernelbase import Kernel class CppKernel(Kernel): # Kernel information. implementation = 'C++' implementation_version = '1.0' language = 'c++' language_version = '1.0' language_info = {'name': 'c++', 'mimetype': 'text/plain'} banner = "C++ kernel" def do_execute(self, code, silent, store_history=True, user_expressions=None, allow_stdin=False): """This function is called when a code cell is executed.""" if not silent: # We run the C++ code and get the output. output = exec_cpp(code) # We send back the result to the frontend. stream_content = {'name': 'stdout', 'text': output} self.send_response(self.iopub_socket, 'stream', stream_content) return {'status': 'ok', # The base class increments the execution # count 'execution_count': self.execution_count, 'payload': [], 'user_expressions': {}, } if __name__ == '__main__': from ipykernel.kernelapp import IPKernelApp IPKernelApp.launch_instance(kernel_class=CppKernel) Out[3]: Appending to cpp/cpp_kernel.py In production code, it would be best to test the compilation and execution, and to fail gracefully by showing an error. See the references at the end of this section for more information. Our wrapper kernel is now implemented in cpp/cpp_kernel.py. The next step is to create a cpp/kernel.json file describing our kernel: In [4]: %%writefile cpp/kernel.json { "argv": ["python", "cpp/cpp_kernel.py", "-f", "{connection_file}" ], "display_name": "C++" } Out[4]: Writing cpp/kernel.json The argv field describes the command that is used to launch a C++ kernel. More information can be found in the references below. Finally, let's install this kernel with the following command: In [5]: !jupyter kernelspec install --replace --user cpp Out[5]: [InstallKernelSpec] Installed kernelspec cpp in /Users/cyrille/Library/Jupyter/kernels/cpp The --replace option forces the installation even if the kernel already exists. The --user option serves to install the kernel in the user directory. We can test the installation of the kernel with the following command: In [6]: !jupyter kernelspec list Out[6]: Available kernels: cpp python3 Now, C++ notebooks can be created in the Notebook, as shown in the following screenshot: C++ kernel in the Notebook Finally, wrapper kernels can also be used in the IPython terminal or the Qt console, using the --kernel option, for example ipython console --kernel cpp. Here are a few references: Kernel documentation at http://jupyter-client.readthedocs.org/en/latest/kernels.html Wrapper kernels at http://jupyter-client.readthedocs.org/en/latest/wrapperkernels.html List of kernels at https://github.com/ipython/ipython/wiki/IPython%20kernels%20for%20other%20languages bash kernel at https://github.com/takluyver/bash_kernel R kernel at https://github.com/takluyver/IRkernel Julia kernel at https://github.com/JuliaLang/IJulia.jl Haskell kernel at https://github.com/gibiansky/IHaskell Customizing the Notebook interface with JavaScript The Notebook application exposes a JavaScript API that allows for a high level of customization. In this section, we will create a new button in the Notebook toolbar to renumber the cells. The JavaScript API is not stable and not well-documented. Although the example in this section has been tested with IPython 4.0, nothing guarantees that it will work in future versions without changes. The commented JavaScript code below adds a new Renumber button. In [1]: %%javascript // This function allows us to add buttons // to the Notebook toolbar. IPython.toolbar.add_buttons_group([ { // The button's label. 'label': 'Renumber all code cells', // The button's icon. // See a list of Font-Awesome icons here: // http://fortawesome.github.io/Font-Awesome/icons/ 'icon': 'fa-list-ol', // The callback function called when the button is // pressed. 'callback': function () { // We retrieve the lists of all cells. var cells = IPython.notebook.get_cells(); // We only keep the code cells. cells = cells.filter(function(c) { return c instanceof IPython.CodeCell; }) // We set the input prompt of all code cells. for (var i = 0; i < cells.length; i++) { cells[i].set_input_prompt(i + 1); } } }]); Executing this cell displays a new button in the Notebook toolbar, as shown in the following screenshot: Adding a new button in the Notebook toolbar You can use the jupyter nbextension command to install notebook extensions (use the --help option to see the list of possible commands). Here are a few repositories with custom JavaScript extensions contributed by the community: https://github.com/minrk/ipython_extensions https://github.com/ipython-contrib/IPython-notebook-extensions So, we have covered several customization options of IPython and the Jupyter Notebook, but there’s so much more that can be done. Take a look at the IPython Interactive Computing and Visualization Cookbook to learn how to create your own custom widgets in the Notebook.

0
0
4994

article-image-building-financial-functions-excel-2010

Packt

07 Jul 2011

5 min read

Building Financial Functions into Excel 2010

Packt

07 Jul 2011

5 min read

Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel Till now, in the previous articles, we have focused on manipulating data within and outside of Excel in order to prepare to make financial decisions. Now that the data has been prepared, re-arranged, or otherwise adjusted, we are able to leverage the functions within Excel to make actual decisions. Utilizing these functions and the individual scenarios, we will be able to effectively eliminate the uncertainty due to poor analysis. Since this article utilizes financial scenarios for demonstrating the use of the various functions, it is important to note that these scenarios take certain "unknowns" for granted, and makes a number of assumptions in order to minimize the complexity of the calculation. Real-world scenarios will require a greater focus on calculating and accounting for all variables. Determining standard deviation for assessing risk In the recipes mentioned so far, we have shown the importance of monitoring and analyzing frequency to determine the likelihood that an event will occur. Standard deviation will now allow for an analysis of the frequency in a different manner, or more specifically, through variance. With standard deviation, we will be able to determine the basic top and bottom thresholds of data, and plot general movement within that threshold to determine the variance within the data range. This variance will allow the calculation of risk within investments. As a financial manager, you must determine the risk associated with investing capital in order to gain a return. In this particular instance, you will invest in stock. In order to minimize loss of investment capital, you must determine the risk associated between investing between two different stocks, Stock A, and Stock B. In this recipe, we will utilize standard deviation to determine which stock, either A or B, presents a higher risk, and hence a greater risk of loss. How to do it... We will begin by entering the selling prices of Stock A and Stock B in columns A and B, respectively: Within this list of selling prices, at first glance we can see that Stock B has a higher selling price. The stock opening price and selling price over the course of 52 weeks almost always remains above that of Stock A. As an investor looking to gain a higher return, we may wish to choose Stock B based on this cursory review; however, high selling price does not negate the need for consistency. In cell C2, enter the formula =STDEV(A2:A53) and press Enter: In cell C3, enter the formula =STDEV(B2:B53) and press Enter: We can see from the calculation of standard deviation, that Stock B has a deviation range or variance of over $20, whereas Stock A's variance is just over $9: Given this information, we can determine that Stock A presents a lower risk than Stock B. If we invest in Stock A, at any given time, utilizing past performance, our average risk of loss is $9, whereas in Stock B we an average risk of $20. How it works... The function of STDEV or standard deviation in Excel utilizes the given numbers as a complete population. This means that it does not account for any other changes or unknowns. Excel will use this data set as a complete set and determine the greatest change from high to low within the numbers. This range of change is your standard deviation. Excel also includes the function STDEVP that treats the data as a selection of a larger population. This function should be sed if you are calculating standard deviation on a subset of data (for example, six months out of an entire year). If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock A, you can see the selling prices of the stock, and how they travel within the deviation range: If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock B, you can see the selling prices of the stock, and understand how they travel within the deviation range: The bars shown on the graphs represent the standard deviation as calculated by Excel. We can see visually that not only does Stock B represent a greater risk with the larger deviation, but also many of the stock prices fall below our deviation, representing further risk to the investor. With funds to invest as a finance manager, Stock A represents a lower risk investment. There's more... Standard deviation can be calculated for almost any data set. For this recipe, we calculated deviation over the course of one year; however, if we expand the data to include multiple years we can further determine long-term risk. While Stock B represents high short-term risk, in the long-term analysis, Stock B may present as a less risky investment. Combining standard deviation with a five-number summary analysis, we can further gain risk and performance information.

0
0
4923

Packt

20 Feb 2015

13 min read

The Spark Programming Model

Packt

20 Feb 2015

13 min read

In this article by Nick Pentreath, author of the book Machine Learning with Spark, we will delve into a high-level overview of Spark's design, we will introduce the SparkContext object as well as the Spark shell, which we will use to interactively explore the basics of the Spark programming model. While this section provides a brief overview and examples of using Spark, we recommend that you read the following documentation to get a detailed understanding:Spark Quick Start: http://spark.apache.org/docs/latest/quick-start.htmlSpark Programming guide, which covers Scala, Java, and Python: http://spark.apache.org/docs/latest/programming-guide.html (For more resources related to this topic, see here.) SparkContext and SparkConf The starting point of writing any Spark program is SparkContext (or JavaSparkContext in Java). SparkContext is initialized with an instance of a SparkConf object, which contains various Spark cluster-configuration settings (for example, the URL of the master node). Once initialized, we will use the various methods found in the SparkContext object to create and manipulate distributed datasets and shared variables. The Spark shell (in both Scala and Python, which is unfortunately not supported in Java) takes care of this context initialization for us, but the following lines of code show an example of creating a context running in the local mode in Scala: val conf = new SparkConf().setAppName("Test Spark App").setMaster("local[4]")val sc = new SparkContext(conf) This creates a context running in the local mode with four threads, with the name of the application set to Test Spark App. If we wish to use default configuration values, we could also call the following simple constructor for our SparkContext object, which works in exactly the same way: val sc = new SparkContext("local[4]", "Test Spark App") The Spark shell Spark supports writing programs interactively using either the Scala or Python REPL (that is, the Read-Eval-Print-Loop, or interactive shell). The shell provides instant feedback as we enter code, as this code is immediately evaluated. In the Scala shell, the return result and type is also displayed after a piece of code is run. To use the Spark shell with Scala, simply run ./bin/spark-shell from the Spark base directory. This will launch the Scala shell and initialize SparkContext, which is available to us as the Scala value, sc. Your console output should look similar to the following screenshot: To use the Python shell with Spark, simply run the ./bin/pyspark command. Like the Scala shell, the Python SparkContext object should be available as the Python variable sc. You should see an output similar to the one shown in this screenshot: Resilient Distributed Datasets The core of Spark is a concept called the Resilient Distributed Dataset (RDD). An RDD is a collection of "records" (strictly speaking, objects of some type) that is distributed or partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the single multithreaded process can be thought of in the same way). An RDD in Spark is fault-tolerant; this means that if a given node or task fails (for some reason other than erroneous user code, such as hardware failure, loss of communication, and so on), the RDD can be reconstructed automatically on the remaining nodes and the job will still complete. Creating RDDs RDDs can be created from existing collections, for example, in the Scala Spark shell that you launched earlier: val collection = List("a", "b", "c", "d", "e")val rddFromCollection = sc.parallelize(collection) RDDs can also be created from Hadoop-based input sources, including the local filesystem, HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input format that implements the Hadoop InputFormat interface, including text files, other standard Hadoop formats, HBase, Cassandra, and many more. The following code is an example of creating an RDD from a text file located on the local filesystem: val rddFromTextFile = sc.textFile("LICENSE") The preceding textFile method returns an RDD where each record is a String object that represents one line of the text file. Spark operations Once we have created an RDD, we have a distributed collection of records that we can manipulate. In Spark's programming model, operations are split into transformations and actions. Generally speaking, a transformation operation applies some function to all the records in the dataset, changing the records in some way. An action typically runs some computation or aggregation operation and returns the result to the driver program where SparkContext is running. Spark operations are functional in style. For programmers familiar with functional programming in Scala or Python, these operations should seem natural. For those without experience in functional programming, don't worry; the Spark API is relatively easy to learn. One of the most common transformations that you will use in Spark programs is the map operator. This applies a function to each record of an RDD, thus mapping the input to some new output. For example, the following code fragment takes the RDD we created from a local text file and applies the size function to each record in the RDD. Remember that we created an RDD of Strings. Using map, we can transform each string to an integer, thus returning an RDD of Ints: val intsFromStringsRDD = rddFromTextFile.map(line => line.size) You should see output similar to the following line in your shell; this indicates the type of the RDD: intsFromStringsRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[5] at map at <console>:14 In the preceding code, we saw the => syntax used. This is the Scala syntax for an anonymous function, which is a function that is not a named method (that is, one defined using the def keyword in Scala or Python, for example). The line => line.size syntax means that we are applying a function where the input variable is to the left of the => operator, and the output is the result of the code to the right of the => operator. In this case, the input is line, and the output is the result of calling line.size. In Scala, this function that maps a string to an integer is expressed as String => Int.This syntax saves us from having to separately define functions every time we use methods such as map; this is useful when the function is simple and will only be used once, as in this example. Now, we can apply a common action operation, count, to return the number of records in our RDD: intsFromStringsRDD.count The result should look something like the following console output: 14/01/29 23:28:28 INFO SparkContext: Starting job: count at <console>:17...14/01/29 23:28:28 INFO SparkContext: Job finished: count at <console>:17, took 0.019227 sres4: Long = 398 Perhaps we want to find the average length of each line in this text file. We can first use the sum function to add up all the lengths of all the records and then divide the sum by the number of records: val sumOfRecords = intsFromStringsRDD.sumval numRecords = intsFromStringsRDD.countval aveLengthOfRecord = sumOfRecords / numRecords The result will be as follows: aveLengthOfRecord: Double = 52.06030150753769 Spark operations, in most cases, return a new RDD, with the exception of most actions, which return the result of a computation (such as Long for count and Double for sum in the preceding example). This means that we can naturally chain together operations to make our program flow more concise and expressive. For example, the same result as the one in the preceding line of code can be achieved using the following code: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count An important point to note is that Spark transformations are lazy. That is, invoking a transformation on an RDD does not immediately trigger a computation. Instead, transformations are chained together and are effectively only computed when an action is called. This allows Spark to be more efficient by only returning results to the driver when necessary so that the majority of operations are performed in parallel on the cluster. This means that if your Spark program never uses an action operation, it will never trigger an actual computation, and you will not get any results. For example, the following code will simply return a new RDD that represents the chain of transformations: val transformedRDD = rddFromTextFile.map(line => line.size).filter(size => size > 10).map(size => size * 2) This returns the following result in the console: transformedRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[8] at map at <console>:14 Notice that no actual computation happens and no result is returned. If we now call an action, such as sum, on the resulting RDD, the computation will be triggered: val computation = transformedRDD.sum You will now see that a Spark job is run, and it results in the following console output: ...14/11/27 21:48:21 INFO SparkContext: Job finished: sum at <console>:16, took 0.193513 scomputation: Double = 60468.0 The complete list of transformations and actions possible on RDDs as well as a set of more detailed examples are available in the Spark programming guide (located at http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations), and the API documentation (the Scala API documentation) is located at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). Caching RDDs One of the most powerful features of Spark is the ability to cache data in memory across a cluster. This is achieved through use of the cache method on an RDD: rddFromTextFile.cache Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first time an action is called on the RDD that initiates a computation, the data is read from its source and put into memory. Hence, the first time such an operation is called, the time it takes to run the task is partly dependent on the time it takes to read the data from the input source. However, when the data is accessed the next time (for example, in subsequent queries in analytics or iterations in a machine learning model), the data can be read directly from memory, thus avoiding expensive I/O operations and speeding up the computation, in many cases, by a significant factor. If we now call the count or sum function on our cached RDD, we will see that the RDD is loaded into memory: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count Indeed, in the following output, we see that the dataset was cached in memory on the first call, taking up approximately 62 KB and leaving us with around 270 MB of memory free: ...14/01/30 06:59:27 INFO MemoryStore: ensureFreeSpace(63454) called with curMem=32960, maxMem=31138775014/01/30 06:59:27 INFO MemoryStore: Block rdd_2_0 stored as values to memory (estimated size 62.0 KB, free 296.9 MB)14/01/30 06:59:27 INFO BlockManagerMasterActor$BlockManagerInfo:Added rdd_2_0 in memory on 10.0.0.3:55089 (size: 62.0 KB, free: 296.9 MB)... Now, we will call the same function again: val aveLengthOfRecordChainedFromCached = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count We will see from the console output that the cached data is read directly from memory: ...14/01/30 06:59:34 INFO BlockManager: Found block rdd_2_0 locally... Spark also allows more fine-grained control over caching behavior. You can use the persist method to specify what approach Spark uses to cache data. More information on RDD caching can be found here: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Broadcast variables and accumulators Another core feature of Spark is the ability to create two special types of variables: broadcast variables and accumulators. A broadcast variable is a read-only variable that is made available from the driver program that runs the SparkContext object to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as machine learning algorithms. Spark makes creating broadcast variables as simple as calling a method on SparkContext as follows: val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e")) The console output shows that the broadcast variable was stored in memory, taking up approximately 488 bytes, and it also shows that we still have 270 MB available to us: 14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488) called with curMem=96414, maxMem=31138775014/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored as values to memory(estimated size 488.0 B, free 296.9 MB)broadCastAList: org.apache.spark.broadcast.Broadcast[List[String]] = Broadcast(1) A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable: sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++ x).collect This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant record from our new RDD appended to the broadcastAList that is our broadcast variable. Notice that we used the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection. We will often use collect when we wish to apply further processing to our results locally within the driver program. Note that collect should generally only be used in cases where we really want to return the full result set to the driver and perform further processing. If we try to call collect on a very large dataset, we might run out of memory on the driver and crash our program.It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, collecting results to the driver is necessary, such as during iterations in many machine learning models. On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now either "1", "2", or "3" at the end): ...14/01/31 10:15:39 INFO SparkContext: Job finished: collect at <console>:15, took 0.025806 sres6: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b, c, d, e, 2), List(a, b, c, d, e, 3)) An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method. For more details on broadcast variables and accumulators, see the Shared Variables section of the Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html#shared-variables. Summary In this article, we learned the basics of Spark's programming model and API using the interactive Scala console. Resources for Article: Further resources on this subject: Ridge Regression [article] Clustering with K-Means [article] Machine Learning Examples Applicable to Businesses [article]

0
0
4922

article-image-building-queries-visually-mysql-query-browser

Packt

23 Oct 2009

3 min read

Building Queries Visually in MySQL Query Browser

Packt

23 Oct 2009

3 min read

MySQL Query Browser, one of the open source MySQL GUI tools from MySQL AB, is used for building MySQL database queries visually. In MySQL Query Browser, you build database queries using just your mouse—click, drag and drop! MySQL Query Browser has plenty of visual query building functions and features. This article shows two examples, building Join and Master-detail queries. These examples will demonstrate some of these functions and features. Join Query A pop-up query toolbar will appear when you drag a table or column from the Object Browser’s Schemata tab to the Query Area. You drop the table or column on the pop-up query toolbar’s button to build your query. The following example demonstrates the use of the pop-up query toolbar to build a join query that involves three tables and two types of join (equi and left outer). Drag and drop the product table from the Schemata to Add Table(s) button. A SELECT query on the product table is written in the Query Area. Drag and drop the item table from Schemata to the JOIN Table(s) button on the Pop-up Query Toolbar. The two tables are joined on the foreign-key, product_code. If no foreign-key relationship exists, the drag and drop won’t have any effect. Drag and drop the order table from Schemata to the LEFT OUTER JOIN button on the Pop-up Query Toolbar. Maximize query area by pressing F11. You get a larger query area, and your lines are sequentially numbered (for easier identification). Move the FROM clause to its next line, by putting your cursor just before the FROM word and press Enter. Similarly, move the ON clause to its next line. Now, you can see all lines completely, and that the item table is left join to the order table on their foreign-key relationship column, the order_number column. As of now our query is SELECT *, i.e. selecting all columns from all tables. Let’s now select the columns we’d like to show at the query’s output. For example, drag and drop the order_number from the item table, product_name from the product table, and then quantity from the item table. (If necessary, expand the table folders to see their columns). The sequence of the selecting the columns is reflected in the SELECT clause (from left to right). Note that you can’t select column from the left join of the order table (if you try, nothing will happen) Next, add an additional condition. Drag and drop the amount column on the WHERE button in the Pop-up Query Toolbar. The column is added, with an AND, in the WHERE clause of the query. Type in its condition value, for example, > 1000. To finalize our query, drag and drop product_name on the ORDER button, and then, order_number (from item table, not order table) on the GROUP button. You’ll see that the GROUP BY and ORDER clauses are ordered correctly, i.e. the GROUP BY clause first before the ORDER BY, regardless of your drag & drop sequence. To test your query, click the Execute button. Your query should run without any error, and display its output in the query area (below the query).

0
0
4920

article-image-how-to-win-kaggle-competition-with-apache-sparkml

Savia Lobo

27 Feb 2018

11 min read

How to win Kaggle competition with Apache SparkML

Savia Lobo

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. The book will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.[/box] In today’s tutorial we will show how to take advantage of Apache SparkML to win a Kaggle competition. We'll use an archived competition offered by BOSCH, a German multinational engineering and electronics company, on production line performance data. The data for this competition represents measurement of parts as they move through Bosch's production line. Each part has a unique Id. The goal is to predict which part will fail quality control (represented by a 'Response' = 1). For more details on the competition data you may visit the website: https://www.kaggle.com/c/bosch-production-line-p erformance/data. Data preparation The challenge data comes in three ZIP packages but we only use two of them. One contains categorical data, one contains continuous data, and the last one contains timestamps of measurements, which we will ignore for now. If you extract the data, you'll get three large CSV files. So the first thing that we want to do is re-encode them into parquet in order to be more space-efficient: def convert(filePrefix : String) = { val basePath = "yourBasePath" var df = spark .read .option("header",true) .option("inferSchema", "true") .csv("basePath+filePrefix+".csv") df = df.repartition(1) df.write.parquet(basePath+filePrefix+".parquet") } convert("train_numeric") convert("train_date") convert("train_categorical") First, we define a function convert that just reads the .csv file and rewrites it as a .parquet file. As you can see, this saves a lot of space: Now we read the files in again as DataFrames from the parquet files : var df_numeric = spark.read.parquet(basePath+"train_numeric.parquet") var df_categorical = spark.read.parquet(basePath+"train_categorical.parquet") Here is the output of the same: This is very high-dimensional data; therefore, we will take only a subset of the columns for this illustration: df_categorical.createOrReplaceTempView("dfcat") var dfcat = spark.sql("select Id, L0_S22_F545 from dfcat") In the following picture, you can see the unique categorical values of that column: Now let's do the same with the numerical dataset: df_numeric.createOrReplaceTempView("dfnum") var dfnum = spark.sql("select Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,Response from dfnum") Here is the output of the same: Finally, we rejoin these two relations: var df = dfcat.join(dfnum,"Id") df.createOrReplaceTempView("df") Then we have to do some NA treatment: var df_notnull = spark.sql(""" select Response as label, case when L0_S22_F545 is null then 'NA' else L0_S22_F545 end as L0_S22_F545, case when L0_S0_F0 is null then 0.0 else L0_S0_F0 end as L0_S0_F0, case when L0_S0_F2 is null then 0.0 else L0_S0_F2 end as L0_S0_F2, case when L0_S0_F4 is null then 0.0 else L0_S0_F4 end as L0_S0_F4 from df """) Feature engineering Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator: import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} var indexer = new StringIndexer() .setHandleInvalid("skip") .setInputCol("L0_S22_F545") .setOutputCol("L0_S22_F545Index") var indexed = indexer.fit(df_notnull).transform(df_notnull) indexed.printSchema As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created: Finally, let's examine some content of the newly created column and compare it with the source column. We can clearly see how the category string gets transformed into a float index: Now we want to apply OneHotEncoder, which is a transformer, in order to generate better features for our machine learning model: var encoder = new OneHotEncoder() .setInputCol("L0_S22_F545Index") .setOutputCol("L0_S22_F545Vec") var encoded = encoder.transform(indexed) As you can see in the following figure, the newly created column L0_S22_F545Vec contains org.apache.spark.ml.linalg.SparseVector objects, which is a compressed representation of a sparse vector: Note: Sparse vector representations: The OneHotEncoder, as many other algorithms, returns a sparse vector of the org.apache.spark.ml.linalg.SparseVector type as, according to the definition, only one element of the vector can be one, the rest has to remain zero. This gives a lot of opportunity for compression as only the position of the elements that are non-zero has to be known. Apache Spark uses a sparse vector representation in the following format: (l,[p],[v]), where l stands for length of the vector, p for position (this can also be an array of positions), and v for the actual values (this can be an array of values). So if we get (13,[10],[1.0]), as in our earlier example, the actual sparse vector looks like this: (0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0). So now that we are done with our feature engineering, we want to create one overall sparse vector containing all the necessary columns for our machine learner. This is done using VectorAssembler: import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors var vectorAssembler = new VectorAssembler() .setInputCols(Array("L0_S22_F545Vec", "L0_S0_F0", "L0_S0_F2","L0_S0_F4")) setOutputCol("features") var assembled = vectorAssembler.transform(encoded) We basically just define a list of column names and a target column, and the rest is done for us: As the view of the features column got a bit squashed, let's inspect one instance of the feature field in more detail: We can clearly see that we are dealing with a sparse vector of length 16 where positions 0, 13, 14, and 15 are non-zero and contain the following values: 1.0, 0.03, -0.034, and -0.197. Done! Let's create a Pipeline out of these components. Testing the feature engineering pipeline Let's create a Pipeline out of our transformers and estimators: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.PipelineModel //Create an array out of individual pipeline stages var transformers = Array(indexer,encoder,assembled) var pipeline = new Pipeline().setStages(transformers).fit(df_notnull) var transformed = pipeline.transform(df_notnull) Note that the setStages method of Pipeline just expects an array of transformers and estimators, which we had created earlier. As parts of the Pipeline contain estimators, we have to run fit on our DataFrame first. The obtained Pipeline object takes a DataFrame in the transform method and returns the results of the transformations: As expected, we obtain the very same DataFrame as we had while running the stages individually in a sequence. Training the machine learning model Now it's time to add another component to the Pipeline: the actual machine learning algorithm-RandomForest: import org.apache.spark.ml.classification.RandomForestClassifier var rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") var model = new Pipeline().setStages(transformers :+ rf).fit(df_notnull) var result = model.transform(df_notnull) This code is very straightforward. First, we have to instantiate our algorithm and obtain it as a reference in rf. We could have set additional parameters to the model but we'll do this later in an automated fashion in the CrossValidation step. Then, we just add the stage to our Pipeline, fit it, and finally transform. The fit method, apart from running all upstream stages, also calls fit on the RandomForestClassifier in order to train it. The trained model is now contained within the Pipeline and the transform method actually creates our predictions column: As we can see, we've now obtained an additional column called prediction, which contains the output of the RandomForestClassifier model. Of course, we've only used a very limited subset of available features/columns and have also not yet tuned the model, so we don't expect to do very well; however, let's take a look at how we can evaluate our model easily with Apache SparkML. Model evaluation Without evaluation, a model is worth nothing as we don't know how accurately it performs. Therefore, we will now use the built-in BinaryClassificationEvaluator in order to assess prediction performance and a widely used measure called areaUnderROC (going into detail here is beyond the scope of this book): import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() import org.apache.spark.ml.param.ParamMap var evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC") var aucTraining = evaluator.evaluate(result, evaluatorParamMap) As we can see, there is a built-in class called org.apache.spark.ml.evaluation.BinaryClassificationEvaluator and there are some other classes for other prediction use cases such as RegressionEvaluator or MulticlassClassificationEvaluator. The evaluator takes a parameter map--in this case, we are telling it to use the areaUnderROC metric--and finally, the evaluate method evaluates the result: As we can see, areaUnderROC is 0.5424418446501833. An ideal classifier would return a score of one. So we are only doing a bit better than random guesses but, as already stated, the number of features that we are looking at is fairly limited. Note : In the previous example we are using the areaUnderROC metric which is used for evaluation of binary classifiers. There exist an abundance of other metrics used for different disciplines of machine learning such as accuracy, precision, recall and F1 score. The following provides a good overview http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. CrossValidation and hyperparameter tuning As explained before, a common step in machine learning is cross-validating your model using testing data against training data and also tweaking the knobs of your machine learning algorithms. Let's use Apache SparkML in order to do this for us, fully automated! First, we have to configure the parameter map and CrossValidator: import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} var paramGrid = new ParamGridBuilder() .addGrid(rf.numTrees, 3 :: 5 :: 10 :: 30 :: 50 :: 70 :: 100 :: 150 :: Nil) .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: "sqrt" :: "log2" :: "onethird" :: Nil) .addGrid(rf.impurity, "gini" :: "entropy" :: Nil) .addGrid(rf.maxBins, 2 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .addGrid(rf.maxDepth, 3 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .build() var crossValidator = new CrossValidator() .setEstimator(new Pipeline().setStages(transformers :+ rf)) .setEstimatorParamMaps(paramGrid) .setNumFolds(5) .setEvaluator(evaluator) var crossValidatorModel = crossValidator.fit(df_notnull) var newPredictions = crossValidatorModel.transform(df_notnull) The org.apache.spark.ml.tuning.ParamGridBuilder is used in order to define the hyperparameter space where the CrossValidator has to search and finally, the org.apache.spark.ml.tuning.CrossValidator takes our Pipeline, the hyperparameter space of our RandomForest classifier, and the number of folds for the CrossValidation as parameters. Now, as usual, we just need to call fit and transform on the CrossValidator and it will basically run our Pipeline multiple times and return a model that performs the best. Do you know how many different models are trained? Well, we have five folds on CrossValidation and five-dimensional hyperparameter space cardinalities between two and eight, so let's do the math: 5 * 8 * 5 * 2 * 7 * 7 = 19600 times! Using the evaluator to assess the quality of the cross-validated and tuned model Now that we've optimized our Pipeline in a fully automatic fashion, let's see how our best model can be obtained: var bestPipelineModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] var stages = bestPipelineModel.stages import org.apache.spark.ml.classification.RandomForestClassificationModel val rfStage = stages(stages.length-1).asInstanceOf[RandomForestClassificationModel] rfStage.getNumTrees rfStage.getFeatureSubsetStrategy rfStage.getImpurity rfStage.getMaxBins rfStage.getMaxDepth The crossValidatorModel.bestModel code basically returns the best Pipeline. Now we use bestPipelineModel.stages to obtain the individual stages and obtain the tuned RandomForestClassificationModel using stages(stages.length 1).asInstanceOf[RandomForestClassificationModel]. Note that stages.length-1 addresses the last stage in the Pipeline, which is our RandomForestClassifier. So now, we can basically run evaluator using the best model and see how it performs: You might have noticed that 0.5362224872557545 is less than 0.5424418446501833, as we've obtained before. So why is this the case? Actually, this time we used cross-validation, which means that the model is less likely to over fit and therefore the score is a bit lower. So let's take a look at the parameters of the best model: Note that we've limited the hyperparameter space, so numTrees, maxBins, and maxDepth have been limited to five, and bigger trees will most likely perform better. So feel free to play around with this code and add features, and also use a bigger hyperparameter space, say, bigger trees. Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a good starting point for your own machine learning project with Apache SparkML. If you found our post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to know more about advanced analytics on your Big Data with latest Apache Spark 2.x.

0
0
4908

Packt

14 Aug 2013

6 min read

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Packt

14 Aug 2013

6 min read

(For more resources related to this topic, see here.) Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows: Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case. Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis. Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses). Hypothesis testing: To verify or disprove a hypothesis using a given dataset. However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program. Getting ready This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page. Unzip the distribution, we will call this directory HADOOP_HOME. Download the sample code for the article and copy the data files. How to do it... If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands: >bin/hadoopdfs -mkdir /data/>bin/hadoopdfs -mkdir /data/amazon-dataset>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/>bin/hadoopdfs -ls /data/amazon-dataset Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1 Use the following command to run the second MapReduce job to sort the results of the first MapReduce job: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.SimpleResultSorter /data/frequency-output1 frequency-output2 You can find the results from the output directory. Copy results to HADOOP_HOME using the following command: $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME. Generate the plot by running the following command from HADOOP_HOME. $gnuplot buyfreq.plot It will generate a file called buyfreq.png, which will look like the following: As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset. How it works... You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java. This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job: public void map(Object key, Text value, Context context) throwsIOException, InterruptedException {List<BuyerRecord> records = BuyerRecord.parseAItemLine(value.toString());for(BuyerRecord record: records){context.write(new Text(record.customerID), new IntWritable(record.itemsBrought.size()));}}public void reduce(Text key, Iterable<IntWritable> values, Context context) {int sum = 0;for (IntWritableval : values) {sum += val.get();}result.set(sum);context.write(key, result);} As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input. The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value. Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results. Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset. Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following: set terminal pngset output "buyfreq.png"set title "Frequency Distribution of Items brought by Buyer";setylabel "Number of Items Brought";setxlabel "Buyers Sorted by Items count";set key left topset log yset log xplot "1.data" using 2 title "Frequency" with linespoints Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both. The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces. Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2. There's more... We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information. Summary In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Comparative Study of NoSQL Products [Article] HBase Administration, Performance Tuning [Article]

0
0
4878

Packt

22 Jun 2015

25 min read

The pandas Data Structures

Packt

22 Jun 2015

25 min read

0
0
4873

Packt

06 Apr 2015

15 min read

Working with Blender

Packt

06 Apr 2015

15 min read

In this article by Jos Dirksen, author of Learning Three.js – the JavaScript 3D Library for WebGL - Second Edition, we will learn about Blender and also about how to load models in Three.js using different formats. (For more resources related to this topic, see here.) Before we get started with the configuration, we'll show the result that we'll be aiming for. In the following screenshot, you can see a simple Blender model that we exported with the Three.js plugin and imported in Three.js with THREE.JSONLoader: Installing the Three.js exporter in Blender To get Blender to export Three.js models, we first need to add the Three.js exporter to Blender. The following steps are for Mac OS X but are pretty much the same on Windows and Linux. You can download Blender from www.blender.org and follow the platform-specific installation instructions. After installation, you can add the Three.js plugin. First, locate the addons directory from your Blender installation using a terminal window: On my Mac, it's located here: ./blender.app/Contents/MacOS/2.70/scripts/addons. For Windows, this directory can be found at the following location: C:UsersUSERNAMEAppDataRoamingBlender FoundationBlender2.7Xscriptsaddons. And for Linux, you can find this directory here: /home/USERNAME/.config/blender/2.7X/scripts/addons. Next, you need to get the Three.js distribution and unpack it locally. In this distribution, you can find the following folder: utils/exporters/blender/2.65/scripts/addons/. In this directory, there is a single subdirectory with the name io_mesh_threejs. Copy this directory to the addons folder of your Blender installation. Now, all we need to do is start Blender and enable the exporter. In Blender, open Blender User Preferences (File | User Preferences). In the window that opens, select the Addons tab, and in the search box, type three. This will show the following screen: At this point, the Three.js plugin is found, but it is still disabled. Check the small checkbox to the right, and the Three.js exporter will be enabled. As a final check to see whether everything is working correctly, open the File | Export menu option, and you'll see Three.js listed as an export option. This is shown in the following screenshot: With the plugin installed, we can load our first model. Loading and exporting a model from Blender As an example, we've added a simple Blender model named misc_chair01.blend in the assets/models folder, which you can find in the sources for this article. In this section, we'll load this model and show the minimal steps it takes to export this model to Three.js. First, we need to load this model in Blender. Use File | Open and navigate to the folder containing the misc_chair01.blend file. Select this file and click on Open. This will show you a screen that looks somewhat like this: Exporting this model to the Three.js JSON format is pretty straightforward. From the File menu, open Export | Three.js, type in the name of the export file, and select Export Three.js. This will create a JSON file in a format Three.js understands. A part of the contents of this file is shown next: { "metadata" : { "formatVersion" : 3.1, "generatedBy" : "Blender 2.7 Exporter", "vertices" : 208, "faces" : 124, "normals" : 115, "colors" : 0, "uvs" : [270,151], "materials" : 1, "morphTargets" : 0, "bones" : 0 }, ... However, we aren't completely done. In the previous screenshot, you can see that the chair contains a wooden texture. If you look through the JSON export, you can see that the export for the chair also specifies a material, as follows: "materials": [{ "DbgColor": 15658734, "DbgIndex": 0, "DbgName": "misc_chair01", "blending": "NormalBlending", "colorAmbient": [0.53132, 0.25074, 0.147919], "colorDiffuse": [0.53132, 0.25074, 0.147919], "colorSpecular": [0.0, 0.0, 0.0], "depthTest": true, "depthWrite": true, "mapDiffuse": "misc_chair01_col.jpg", "mapDiffuseWrap": ["repeat", "repeat"], "shading": "Lambert", "specularCoef": 50, "transparency": 1.0, "transparent": false, "vertexColors": false }], This material specifies a texture, misc_chair01_col.jpg, for the mapDiffuse property. So, besides exporting the model, we also need to make sure the texture file is also available to Three.js. Luckily, we can save this texture directly from Blender. In Blender, open the UV/Image Editor view. You can select this view from the drop-down menu on the left-hand side of the File menu option. This will replace the top menu with the following: Make sure the texture you want to export is selected, misc_chair_01_col.jpg in our case (you can select a different one using the small image icon). Next, click on the Image menu and use the Save as Image menu option to save the image. Save it in the same folder where you saved the model using the name specified in the JSON export file. At this point, we're ready to load the model into Three.js. The code to load this into Three.js at this point looks like this: var loader = new THREE.JSONLoader(); loader.load('../assets/models/misc_chair01.js', function (geometry, mat) { mesh = new THREE.Mesh(geometry, mat[0]); mesh.scale.x = 15; mesh.scale.y = 15; mesh.scale.z = 15; scene.add(mesh); }, '../assets/models/'); We've already seen JSONLoader before, but this time, we use the load function instead of the parse function. In this function, we specify the URL we want to load (points to the exported JSON file), a callback that is called when the object is loaded, and the location, ../assets/models/, where the texture can be found (relative to the page). This callback takes two parameters: geometry and mat. The geometry parameter contains the model, and the mat parameter contains an array of material objects. We know that there is only one material, so when we create THREE.Mesh, we directly reference that material. If you open the 05-blender-from-json.html example, you can see the chair we just exported from Blender. Using the Three.js exporter isn't the only way of loading models from Blender into Three.js. Three.js understands a number of 3D file formats, and Blender can export in a couple of those formats. Using the Three.js format, however, is very easy, and if things go wrong, they are often quickly found. In the following section, we'll look at a couple of the formats Three.js supports and also show a Blender-based example for the OBJ and MTL file formats. Importing from 3D file formats At the beginning of this article, we listed a number of formats that are supported by Three.js. In this section, we'll quickly walk through a couple of examples for those formats. Note that for all these formats, an additional JavaScript file needs to be included. You can find all these files in the Three.js distribution in the examples/js/loaders directory. The OBJ and MTL formats OBJ and MTL are companion formats and often used together. The OBJ file defines the geometry, and the MTL file defines the materials that are used. Both OBJ and MTL are text-based formats. A part of an OBJ file looks like this: v -0.032442 0.010796 0.025935 v -0.028519 0.013697 0.026201 v -0.029086 0.014533 0.021409 usemtl Material s 1 f 2731 2735 2736 2732 f 2732 2736 3043 3044 The MTL file defines materials like this: newmtl Material Ns 56.862745 Ka 0.000000 0.000000 0.000000 Kd 0.360725 0.227524 0.127497 Ks 0.010000 0.010000 0.010000 Ni 1.000000 d 1.000000 illum 2 The OBJ and MTL formats by Three.js are understood well and are also supported by Blender. So, as an alternative, you could choose to export models from Blender in the OBJ/MTL format instead of the Three.js JSON format. Three.js has two different loaders you can use. If you only want to load the geometry, you can use OBJLoader. We used this loader for our example (06-load-obj.html). The following screenshot shows this example: To import this in Three.js, you have to add the OBJLoader JavaScript file: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> Import the model like this: var loader = new THREE.OBJLoader(); loader.load('../assets/models/pinecone.obj', function (loadedMesh) { var material = new THREE.MeshLambertMaterial({color: 0x5C3A21}); // loadedMesh is a group of meshes. For // each mesh set the material, and compute the information // three.js needs for rendering. loadedMesh.children.forEach(function (child) { child.material = material; child.geometry.computeFaceNormals(); child.geometry.computeVertexNormals(); }); mesh = loadedMesh; loadedMesh.scale.set(100, 100, 100); loadedMesh.rotation.x = -0.3; scene.add(loadedMesh); }); In this code, we use OBJLoader to load the model from a URL. Once the model is loaded, the callback we provide is called, and we add the model to the scene. Usually, a good first step is to print out the response from the callback to the console to understand how the loaded object is built up. Often with these loaders, the geometry or mesh is returned as a hierarchy of groups. Understanding this makes it much easier to place and apply the correct material and take any other additional steps. Also, look at the position of a couple of vertices to determine whether you need to scale the model up or down and where to position the camera. In this example, we've also made the calls to computeFaceNormals and computeVertexNormals. This is required to ensure that the material used (THREE.MeshLambertMaterial) is rendered correctly. The next example (07-load-obj-mtl.html) uses OBJMTLLoader to load a model and directly assign a material. The following screenshot shows this example: First, we need to add the correct loaders to the page: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> <script type="text/javascript" src="../libs/MTLLoader.js"> </script> <script type="text/javascript" src="../libs/OBJMTLLoader.js"> </script> We can load the model from the OBJ and MTL files like this: var loader = new THREE.OBJMTLLoader(); loader.load('../assets/models/butterfly.obj', '../assets/ models/butterfly.mtl', function(object) { // configure the wings var wing2 = object.children[5].children[0]; var wing1 = object.children[4].children[0]; wing1.material.opacity = 0.6; wing1.material.transparent = true; wing1.material.depthTest = false; wing1.material.side = THREE.DoubleSide; wing2.material.opacity = 0.6; wing2.material.depthTest = false; wing2.material.transparent = true; wing2.material.side = THREE.DoubleSide; object.scale.set(140, 140, 140); mesh = object; scene.add(mesh); mesh.rotation.x = 0.2; mesh.rotation.y = -1.3; }); The first thing to mention before we look at the code is that if you receive an OBJ file, an MTL file, and the required texture files, you'll have to check how the MTL file references the textures. These should be referenced relative to the MTL file and not as an absolute path. The code itself isn't that different from the one we saw for THREE.ObjLoader. We specify the location of the OBJ file, the location of the MTL file, and the function to call when the model is loaded. The model we've used as an example in this case is a complex model. So, we set some specific properties in the callback to fix some rendering issues, as follows: The opacity in the source files was set incorrectly, which caused the wings to be invisible. So, to fix that, we set the opacity and transparent properties ourselves. By default, Three.js only renders one side of an object. Since we look at the wings from two sides, we need to set the side property to the THREE.DoubleSide value. The wings caused some unwanted artifacts when they needed to be rendered on top of each other. We've fixed that by setting the depthTest property to false. This has a slight impact on performance but can often solve some strange rendering artifacts. But, as you can see, you can easily load complex models directly into Three.js and render them in real time in your browser. You might need to fine-tune some material properties though. Loading a Collada model Collada models (extension is .dae) are another very common format for defining scenes and models (and animations as well). In a Collada model, it is not just the geometry that is defined, but also the materials. It's even possible to define light sources. To load Collada models, you have to take pretty much the same steps as for the OBJ and MTL models. You start by including the correct loader: <script type="text/javascript" src="../libs/ColladaLoader.js"> </script> For this example, we'll load the following model: Loading a truck model is once again pretty simple: var mesh; loader.load("../assets/models/dae/Truck_dae.dae", function (result) { mesh = result.scene.children[0].children[0].clone(); mesh.scale.set(4, 4, 4); scene.add(mesh); }); The main difference here is the result of the object that is returned to the callback. The result object has the following structure: var result = { scene: scene, morphs: morphs, skins: skins, animations: animData, dae: { ... } }; In this article, we're interested in the objects that are in the scene parameter. I first printed out the scene to the console to look where the mesh was that I was interested in, which was result.scene.children[0].children[0]. All that was left to do was scale it to a reasonable size and add it to the scene. A final note on this specific example—when I loaded this model for the first time, the materials didn't render correctly. The reason was that the textures used the .tga format, which isn't supported in WebGL. To fix this, I had to convert the .tga files to .png and edit the XML of the .dae model to point to these .png files. As you can see, for most complex models, including materials, you often have to take some additional steps to get the desired results. By looking closely at how the materials are configured (using console.log()) or replacing them with test materials, problems are often easy to spot. Loading the STL, CTM, VTK, AWD, Assimp, VRML, and Babylon models We're going to quickly skim over these file formats as they all follow the same principles: Include [NameOfFormat]Loader.js in your web page. Use [NameOfFormat]Loader.load() to load a URL. Check what the response format for the callback looks like and render the result. We have included an example for all these formats: Name Example Screenshot STL 08-load-STL.html CTM 09-load-CTM.html VTK 10-load-vtk.html AWD 11-load-awd.html Assimp 12-load-assimp.html VRML 13-load-vrml.html Babylon The Babylon loader is slightly different from the other loaders in this table. With this loader, you don't load a single THREE.Mesh or THREE.Geometry instance, but with this loader, you load a complete scene, including lights. 14-load-babylon.html If you look at the source code for these examples, you might see that for some of them, we need to change some material properties or do some scaling before the model is rendered correctly. The reason we need to do this is because of the way the model is created in its external application, giving it different dimensions and grouping than we normally use in Three.js. Summary In this article, we've almost shown all the supported file formats. Using models from external sources isn't that hard to do in Three.js. Especially for simple models, you only have to take a few simple steps. When working with external models, or creating them using grouping and merging, it is good to keep a couple of things in mind. The first thing you need to remember is that when you group objects, they still remain available as individual objects. Transformations applied to the parent also affect the children, but you can still transform the children individually. Besides grouping, you can also merge geometries together. With this approach, you lose the individual geometries and get a single new geometry. This is especially useful when you're dealing with thousands of geometries you need to render and you're running into performance issues. Three.js supports a large number of external formats. When using these format loaders, it's a good idea to look through the source code and log out the information received in the callback. This will help you to understand the steps you need to take to get the correct mesh and set it to the correct position and scale. Often, when the model doesn't show correctly, this is caused by its material settings. It could be that incompatible texture formats are used, opacity is incorrectly defined, or the format contains incorrect links to the texture images. It is usually a good idea to use a test material to determine whether the model itself is loaded correctly and log the loaded material to the JavaScript console to check for unexpected values. It is also possible to export meshes and scenes, but remember that GeometryExporter, SceneExporter, and SceneLoader of Three.js are still work in progress. Resources for Article: Further resources on this subject: Creating the maze and animating the cube [article] Mesh animation [article] Working with the Basic Components That Make Up a Three.js Scene [article]

0
0
4871

article-image-creating-our-first-universe

Packt

22 Sep 2014

18 min read

Creating Our First Universe

Packt

22 Sep 2014

18 min read

0
0
4864

article-image-heart-diseases-prediction-using-spark-200

Packt

18 Oct 2016

16 min read

Heart Diseases Prediction using Spark 2.0.0

Packt

18 Oct 2016

16 min read

0
0
4852

How-To Tutorials - Data

Factor variables in R

Pattern Mining using Spark MLlib - Part 1

SQL Tuning Enhancements in Oracle 12c

How to implement discrete convolution on a 2D dataset

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Customizing IPython

Building Financial Functions into Excel 2010

The Spark Programming Model

Building Queries Visually in MySQL Query Browser

How to win Kaggle competition with Apache SparkML

Trending Topics

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

The pandas Data Structures

Working with Blender

Creating Our First Universe

Heart Diseases Prediction using Spark 2.0.0

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access