Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7014 Articles
article-image-object-detection-go-tensorflow
Kunal Parikh
07 Nov 2017
10 min read
Save for later

Implementing Object detection with Go using TensorFlow

Kunal Parikh
07 Nov 2017
10 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Machine Learning with Go, Chapter 8, Neural Networks and Deep Learning, written by Daniel Whitenack. The associated code bundle is available at the end of the article.[/box] Deep learning models are powerful! Especially for tasks like computer vision. However, you should also keep in mind that complicated combinations of these neural net components are also extremely hard to interpret. That is, determining why the model made a certain prediction can be near impossible. This can be a problem when you need to maintain compliance in certain industries and jurisdictions, and it also might inhibit debugging or maintenance of your applications. That being said, there are some major efforts to improve the interpretability of deep learning models. Notable among these efforts is the LIME project: Deep learning with Go There are a variety of options when you are looking to build or utilize deep learning models from Go. This, as with deep learning itself, is an ever-changing landscape. However, the options for building, training and utilizing deep learning models in Go are generally as follows: Use a Go package: There are Go packages that allow you to use Go as your main interface to build and train deep learning models. The most features and developed of these packages is Gorgonia. It treats Go as a first-class citizen and is written in Go, even if it does make significant usage of cgo to interface with numerical libraries. Use an API or Go client for a non-Go DL framework: You can interface with popular deep learning services and frameworks from Go including TensorFlow, MachineBox, H2O, and the various cloud providers or third-party API offerings (such as IBM Watson). TensorFlow and Machine Box actually have Go bindings or SDKs, which are continually improving. For the other services, you may need to interact via REST or even call binaries using exec. Use cgo: Of course, Go can talk to and integrate with C/C++ libraries for deep learning, including the TensorFlow libraries and various libraries from Intel. However, this is a difficult road, and it is only recommended when absolutely necessary. As TensorFlow is by far the most popular framework for deep learning (at the moment), we will briefly explore the second category listed here. However, the Tensorflow Go bindings are under active development and some functionality is quite crude at the moment. The TensorFlow team recommends that if you are going to use a TensorFlow model in Go, you first train and export this model using Python. That pre-trained model can then be utilized from Go, as we will demonstrate in the next section. There are a number of members of the community working very hard to make Go more of a first-class citizen for TensorFlow. As such, it is likely that the rough edges of the TensorFlow bindings will be smoothed over the coming year. Setting up TensorFlow for use with Go The TensorFlow team has provided some good docs to install TensorFlow and get it ready for usage with Go. These docs can be found here. There are a couple of preliminary steps, but once you have the TensorFlow C libraries installed, you can get the following Go package: $ go get github.com/tensorflow/tensorflow/tensorflow/go Everything should be good to go if you were able to get github.com/tensorflow/tensorflow/tensorflow/go without error, but you can make sure that you are ready to use TensorFlow by executing the following tests: $ go test github.com/tensorflow/tensorflow/tensorflow/go ok github.com/tensorflow/tensorflow/tensorflow/go 0.045s Retrieving and calling a pretrained TensorFlow model The model that we are going to use is a Google model for object recognition in images called Inception. The model can be retrieved as follows: $ mkdir model $ cd model $ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip --2017-09-09 18:29:03-- https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.6.112, 2607:f8b0:4009:812::2010 Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.6.112|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 49937555 (48M) [application/zip] Saving to: ‘inception5h.zip’ inception5h.zip 100%[====================================================================== ===================================================>] 47.62M 19.0MB/s in 2.5s 2017-09-09 18:29:06 (19.0 MB/s) - ‘inception5h.zip’ saved [49937555/49937555] $ unzip inception5h.zip Archive: inception5h.zip inflating: imagenet_comp_graph_label_strings.txt inflating: tensorflow_inception_graph.pb inflating: LICENSE After unzipping the compressed model, you should see a *.pb file. This is a protobuf file that represents a frozen state of the model. Think back to our simple neural network. The network was fully defined by a series of weights and biases. Although more complicated, this model can be defined in a similar way and these definitions are stored in this protobuf file. To call this model, we will use some example code from the TensorFlow Go bindings docs--. This code loads the model and uses the model to detect and label the contents of a *.jpg image. As the code is included in the TensorFlow docs, I will spare the details and just highlight a couple of snippets. To load the model, we perform the following: // Load the serialized GraphDef from a file. modelfile, labelsfile, err := modelFiles(*modeldir) if err != nil { log.Fatal(err) } model, err := ioutil.ReadFile(modelfile) if err != nil { log.Fatal(err) } Then we load the graph definition of the deep learning model and create a new TensorFlow session with the graph, as shown in the following code: // Construct an in-memory graph from the serialized form. graph := tf.NewGraph() if err := graph.Import(model, ""); err != nil { log.Fatal(err) } // Create a session for inference over graph. session, err := tf.NewSession(graph, nil) if err != nil { log.Fatal(err) } defer session.Close() Finally, we can make an inference using the model as follows: // Run inference on *imageFile. // For multiple images, session.Run() can be called in a loop (and concurrently). Alternatively, images can be batched since the model // accepts batches of image data as input. tensor, err := makeTensorFromImage(*imagefile) if err != nil { log.Fatal(err) } output, err := session.Run( map[tf.Output]*tf.Tensor{ graph.Operation("input").Output(0): tensor, }, []tf.Output{ graph.Operation("output").Output(0), }, nil) if err != nil { log.Fatal(err) } // output[0].Value() is a vector containing probabilities of // labels for each image in the "batch". The batch size was 1. // Find the most probable label index. probabilities := output[0].Value().([][]float32)[0] printBestLabel(probabilities, labelsfile) Object detection with Go using TensorFlow The Go program for object detection, as specified in the TensorFlow GoDocs, can be called as follows: $ ./myprogram -dir=<path/to/the/model/dir> -image=<path/to/a/jpg/image> When the program is called, it will utilize the pretrained and loaded model to infer the contents of the specified image. It will then output the most likely contents of that image along with its calculated probability. To illustrate this, let's try performing the object detection on the following image of an airplane, saved as airplane.jpg: Running the TensorFlow model from Go gives the following results: $ go build $ ./myprogram -dir=model -image=airplane.jpg 2017-09-09 20:17:30.655757: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655807: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655814: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655818: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (86% likely) airliner After some suggestions about speeding up CPU computations, we get a result: airliner. Wow! That's pretty cool. We just performed object recognition with TensorFlow right from our Go program! Let try another one. This time, we will use pug.jpg, which looks like the following: Running our program again with this image gives the following: $ ./myprogram -dir=model -image=pug.jpg 2017-09-09 20:20:32.323855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323902: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323906: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323911: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (84% likely) pug Success! Not only did the model detect that there was a dog in the picture, it correctly identified that there was a pug dog in the picture. Let try just one more. As this is a Go article, we cannot resist trying gopher.jpg, which looks like the following (huge thanks to Renee French, the artist behind the Go gopher): Running the model gives the following result: $ ./myprogram -dir=model -image=gopher.jpg 2017-09-09 20:25:57.967753: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967801: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967808: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967812: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967817: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (12% likely) safety pin Well, I guess we can't win them all. Looks like we need to refactor our model to be able to recognize Go gophers. More specifically, we should probably add a bunch of Go gophers to our training dataset, because a Go gopher is definitely not a safety pin! [box type="download" align="" class="" width=""]The code for this exercise is available here.[/box] Summary Congratulations! We have gone from parsing data with Go to calling deep learning models from Go. You now know the basics of neural networks and can implement them and utilize them in your Go programs. In the next chapter, we will discuss how to get these models and applications off of your laptops and run them at production scale in data pipelines. If you enjoyed the above excerpt from the book Machine Learning with Go, check out the book to learn how to build machine learning apps with Go.
Read more
  • 0
  • 0
  • 19384

article-image-pattern-mining-using-spark-mllib-part-2
Aarthi Kumaraswamy
06 Nov 2017
15 min read
Save for later

Pattern mining using Spark MLlib - Part 2

Aarthi Kumaraswamy
06 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] In part 1 of the tutorial, we motivated and introduced three pattern mining problems along with the necessary notation to properly talk about them. In part 2, we will now discuss how each of these problems can be solved with an algorithm available in Spark MLlib. As is often the case, actually applying the algorithms themselves is fairly simple due to Spark MLlib's convenient run method available for most algorithms. What is more challenging is to understand the algorithms and the intricacies that come with them. To this end, we will explain the three pattern mining algorithms one by one, and study how they are implemented and how to use them on toy examples. Only after having done all this will we apply these algorithms to a real-life data set of click events retrieved from http:/ / MSNBC. com. The documentation for the pattern mining algorithms in Spark can be found at https:/ / spark. apache. org/ docs/ 2. 1. 0/ mllib- frequent- pattern- mining. html. It provides a good entry point with examples for users who want to dive right in. Frequent pattern mining with FP-growth When we introduced the frequent pattern mining problem, we also quickly discussed a strategy to address it based on the apriori principle. The approach was based on scanning the whole transaction database again and again to expensively generate pattern candidates of growing length and checking their support. We indicated that this strategy may not be feasible for very large data. The so-called FP-growth algorithm, where FP stands for frequent pattern, provides an interesting solution to this data mining problem. The algorithm was originally described in Mining Frequent Patterns without Candidate Generation, available at https:/ /www. cs. sfu. ca/~jpei/ publications/sigmod00. pdf. We will start by explaining the basics of this algorithm and then move on to discussing its distributed version, parallel FP-growth, which has been introduced in PFP: Parallel FP-Growth for Query Recommendation, found at https:/ /static.googleusercontent.com/ media/research. google. com/en/ /pubs/ archive/ 34668. pdf. While Spark's implementation is based on the latter paper, it is best to first understand the baseline algorithm and extend from there. The core idea of FP-growth is to scan the transaction database D of interest precisely once in the beginning, find all the frequent patterns of length 1, and build a special tree structure called FP-tree from these patterns. Once this step is done, instead of working with D, we only do recursive computations on the usually much smaller FP-tree. This step is called the FP-growth step of the algorithm, since it recursively constructs trees from the subtrees of the original tree to identify patterns. We will call this procedure fragment pattern growth, which does not require us to generate candidates but is rather built on a divide-and-conquer strategy that heavily reduces the workload in each recursion step. To be more precise, let's first define what an FP-tree is and what it looks like in an example. Recall the example database we used in the last section, shown in Table 1. Our item set consisted of the following 15 grocery items, represented by their first letter: b, c, a, e, d, f, p, m, i, l, o, h, j, k, s. We also discussed the frequent items; that is, patterns of length 1, for a minimum support threshold of t = 0.6, were given by {f, c, b, a, m, p}. In FP-growth, we first use the fact that the ordering of items does not matter for the frequent pattern mining problem; that is, we can choose the order in which to present the frequent items. We do so by ordering them by decreasing frequency. To summarize the situation, let's have a look at the following table: Transaction ID Transaction Ordered frequent items 1 a, c, d, f, g, i, m, p f, c, a, m, p 2 a, b, c, f, l, m, o f, c, a, b, m 3 b, f, h, j, o f, b 4 b, c, k, s, p c, b, p 5 a, c, e, f, l, m, n, p f, c, a, m, p Table 3: Continuation of the example started with Table 1, augmenting the table by ordered frequent items. As we can see, ordering frequent items like this already helps us to identify some structure. For instance, we see that the item set {f, c, a, m, p} occurs twice and is slightly altered once as {f, c, a, b, m}. The key idea of FP-growth is to use this representation to build a tree from the ordered frequent items that reflect the structure and interdependencies of the items in the third column of Table 3. Every FP-tree has a so-called root node that is used as a base for connecting ordered frequent items as constructed. On the right of the following diagram, we see what is meant by this: Figure 1: FP-tree and header table for our frequent pattern mining running example. The left-hand side of Figure 1 shows a header table that we will explain and formalize in just a bit, while the right-hand side shows the actual FP-tree. For each of the ordered frequent items in our example, there is a directed path starting from the root, thereby representing it. Each node of the tree keeps track of not only the frequent item itself but also of the number of paths traversed through this node. For instance, four of the five ordered frequent item sets start with the letter f and one with c. Thus, in the FP-tree, we see f: 4 and c: 1 at the top level. Another interpretation of this fact is that f is a prefix for four item sets and c for one. For another example of this sort of reasoning, let's turn our attention to the lower left of the tree, that is, to the leaf node p: 2. Two occurrences of p tells us that precisely two identical paths end here, which we already know: {f, c, a, m, p} is represented twice. This observation is interesting, as it already hints at a technique used in FP-growth--starting at the leaf nodes of the tree, or the suffixes of the item sets, we can trace back each frequent item set, and the union of all these distinct root node paths yields all the paths--an important idea for parallelization. The header table you see on the left of Figure 1 is a smart way of storing items. Note that by the construction of the tree, a node is not the same as a frequent item but, rather, items can and usually do occur multiple times, namely once for each distinct path they are part of. To keep track of items and how they relate, the header table is essentially a linked list of items, that is, each item occurrence is linked to the next by means of this table. We indicated the links for each frequent item by horizontal dashed lines in Figure 1 for illustration purposes. With this example in mind, let's now give a formal definition of an FP-tree. An FP-tree T is a tree that consists of a root node together with frequent item prefix subtrees starting at the root and a frequent item header table. Each node of the tree consists of a triple, namely the item name, its occurrence count, and a node link referring to the next node of the same name, or null if there is no such next node. To quickly recap, to build T, we start by computing the frequent items for the given minimum support threshold t, and then, starting from the root, insert each path represented by the sorted frequent pattern list of a transaction into the tree. Now, what do we gain from this? The most important property to consider is that all the information needed to solve the frequent pattern mining problem is encoded in the FP-tree T because we effectively encode all co-occurrences of frequent items with repetition. Since T can also have at most as many nodes as the occurrences of frequent items, T is usually much smaller than our original database D. This means that we have mapped the mining problem to a problem on a smaller data set, which in itself reduces the computational complexity compared with the naive approach sketched earlier. Next, we'll discuss how to grow patterns recursively from fragments obtained from the constructed FP tree. To do so, let's make the following observation. For any given frequent item x, we can obtain all the patterns involving x by following the node links for x, starting from the header table entry for x, by analyzing at the respective subtrees. To explain how exactly, we further study our example and, starting at the bottom of the header table, analyze patterns containing p. From our FP-tree T, it is clear that p occurs in two paths: (f:4, c:3, a:3, m:3, p:2) and (c:1, b:1, p:1), following the node links for p. Now, in the first path, p occurs only twice, that is, there can be at most two total occurrences of the pattern {f, c, a, m, p} in the original database D. So, conditional on p being present, the paths involving p actually read as follows: (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1). In fact, since we know we want to analyze patterns, given p, we can shorten the notation a little and simply write (f:2, c:2, a:2, m:2) and (c:1, b:1). This is what we call the conditional pattern base for p. Going one step further, we can construct a new FP-tree from this conditional database. Conditioning on three occurrences of p, this new tree does only consist of a single node, namely (c:3). This means that we end up with {c, p} as a single pattern involving p, apart from p itself. To have a better means of talking about this situation, we introduce the following notation: the conditional FP-tree for p is denoted by {(c:3)}|p. To gain more intuition, let's consider one more frequent item and discuss its conditional pattern base. Continuing bottom to top and analyzing m, we again see two paths that are relevant: (f:4, c:3, a:3, m:2) and (f:4, c:3, a:3, b:1, m:1). Note that in the first path, we discard the p:2 at the end, since we have already covered the case of p. Following the same logic of reducing all other counts to the count of the item in question and conditioning on m, we end up with the conditional pattern base {(f:2, c:2, a:2), (f:1, c:1, a:1, b:1)}. The conditional FP- tree in this situation is thus given by {f:3, c:3, a:3}|m. It is now easy to see that actually every possible combination of m with each of f, c, and a forms a frequent pattern. The full set of patterns, given m, is thus {m}, {am}, {cm}, {fm}, {cam], {fam}, {fcm}, and {fcam}. By now, it should become clear as to how to continue, and we will not carry out this exercise in full but rather summarize the outcome of it in the following table: Frequent pattern Conditional pattern base Conditional FP-tree p {(f:2, c:2, a:2, m:2), (c:1, b:1)} {(c:3)}|p m {(f :2, c:2, a:2), (f :1, c:1, a:1, b:1)} {f:3, c:3, a:3}|m b {(f :1, c:1, a:1), (f :1), (c:1)} null a {(f:3, c:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f null null Table 4: The complete list of conditional FP-trees and conditional pattern bases for our running example. As this derivation required a lot of attention to detail, let's take a step back and summarize the situation so far: Starting from the original FP-tree T, we iterated through all the items using node links. For each item x, we constructed its conditional pattern base and its conditional FP-tree. Doing so, we used the following two properties:We discarded all the items following x in each potential pattern, that is, we only kept the prefix of xWe modified the item counts in the conditional pattern base to match the count of x Modifying a path using the latter two properties, we called the transformed prefix path of x. To finally state the FP-growth step of the algorithm, we need two more fundamental observations that we have already implicitly used in the example. Firstly, the support of an item in a conditional pattern base is the same as that of its representation in the original database. Secondly, starting from a frequent pattern x in the original database and an arbitrary set of items y, we know that xy is a frequent pattern if and only if y is. These two facts can easily be derived in general, but should be clearly demonstrated in the preceding example. What this means is that we can completely focus on finding patterns in conditional pattern bases, as joining them with frequent patterns is again a pattern, and this way, we can find all the patterns. This mechanism of recursively growing patterns by computing conditional pattern bases is therefore called pattern growth, which is why FP-growth bears its name. With all this in mind, we can now summarize the FP-growth procedure in pseudocode, as follows: def fpGrowth(tree: FPTree, i: Item): if (tree consists of a single path P){ compute transformed prefix path P' of P return all combinations p in P' joined with i } else{ for each item in tree { newI = i joined with item construct conditional pattern base and conditional FP-tree newTree call fpGrowth(newTree, newI) } } With this procedure, we can summarize our description of the complete FP-growth algorithm as follows: Compute frequent items from D and compute the original FP-tree T from them (FP-tree computation). Run fpGrowth(T, null) (FP-growth computation). Having understood the base construction, we can now proceed to discuss a parallel extension of base FP-growth, that is, the basis of Spark's implementation. Parallel FP- growth, or PFP for short, is a natural evolution of FP-growth for parallel computing engines such as Spark. It addresses the following problems with the baseline algorithm: Distributed storage: For frequent pattern mining, our database D may not fit into memory, which can already render FP-growth in its original form unapplicable. Spark does help in this regard for obvious reasons. Distributed computing: With distributed storage in place, we will have to take care of parallelizing all the steps of the algorithm suitably as well and PFP does precisely this. Adequate support values: When dealing with finding frequent patterns, we usually do not want to set the minimum support threshold t too high so as to find interesting patterns in the long tail. However, a small t might prevent the FP-tree from fitting into memory for a sufficiently large D, which would force us to increase t. PFP successfully addresses this problem as well, as we will see. The basic outline of PFP, with Spark for implementation in mind, is as follows: Sharding: Instead of storing our database D on a single machine, we distribute it to multiple partitions. Regardless of the particular storage layer, using Spark we can, for instance, create an RDD to load D. Parallel frequent item count: The first step of computing frequent items of D can be naturally performed as a map-reduce operation on an RDD. Building groups of frequent items: The set of frequent items is divided into a number of groups, each with a unique group ID. Parallel FP-growth: The FP-growth step is split into two steps to leverage parallelism: Map phase: The output of a mapper is a pair comprising the group ID and the corresponding transaction. Reduce phase: Reducers collect data according to the group ID and carry out FP-growth on these group-dependent transactions. Aggregation: The final step in the algorithm is the aggregation of results over group IDs. In light of already having spent a lot of time with FP-growth on its own, instead of going into too many implementation details of PFP in Spark, let's instead see how to use the actual algorithm on the toy example that we have used throughout: import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val transactions: RDD[Array[String]] = sc.parallelize(Array( Array("a", "c", "d", "f", "g", "i", "m", "p"), Array("a", "b", "c", "f", "l", "m", "o"), Array("b", "f", "h", "j", "o"), Array("b", "c", "k", "s", "p"), Array("a", "c", "e", "f", "l", "m", "n", "p") )) val fpGrowth = new FPGrowth() .setMinSupport(0.6) .setNumPartitions(5) val model = fpGrowth.run(transactions) model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq) } The code is straightforward. We load the data into transactions and initialize Spark's FPGrowth implementation with a minimum support value of 0.6 and 5 partitions. This returns a model that we can run on the transactions constructed earlier. Doing so gives us access to the patterns or frequent item sets for the specified minimum support, by calling freqItemsets, which, printed in a formatted way, yields the following output of 18 patterns in total: [box type="info" align="" class="" width=""]Recall that we have defined transactions as sets, and we often call them item sets. This means that within such an item set, a particular item can only occur once, and FPGrowth depends on this. If we were to replace, for instance, the third transaction in the preceding example by Array("b", "b", "h", "j", "o"), calling run on these transactions would throw an error message. We will see later on how to deal with such situations.[/box] The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn how to fully implement and deploy pattern mining applications in Spark among other machine learning tasks using Spark, check out the book. 
Read more
  • 0
  • 0
  • 5784

article-image-pattern-mining-using-spark-part-1
Aarthi Kumaraswamy
03 Nov 2017
15 min read
Save for later

Pattern Mining using Spark MLlib - Part 1

Aarthi Kumaraswamy
03 Nov 2017
15 min read
[box type="note" align="" class="" width=""]The following two-part tutorial is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box] When collecting real-world data between individual measures or events, there are usually very intricate and highly complex relationships to observe. The guiding example for this tutorial is the observation of click events that users generate on a website and its subdomains. Such data is both interesting and challenging to investigate. It is interesting, as there are usually many patterns that groups of users show in their browsing behavior and certain rules they might follow. Gaining insights about user groups, in general, is of interest, at least for the company running the website and might be the focus of their data science team. Methodology aside, putting a production system in place that can detect patterns in real time, for instance, to find malicious behavior, can be very challenging technically. It is immensely valuable to be able to understand and implement both the algorithmic and technical sides. In this tutorial, we will look into doing pattern mining in Spark. The tutorial is split up into two main sections. In the first, we will first introduce the three available pattern mining algorithms that Spark currently comes with and then apply them to an interesting dataset. In particular, you will learn the following from this two-part tutorial: The basic principles of frequent pattern mining. Useful and relevant data formats for applications. Understanding and comparing three pattern mining algorithms available in Spark, namely FP-growth, association rules, and prefix span. Frequent pattern mining When presented with a new data set, a natural sequence of questions is: What kind of data do we look at; that is, what structure does it have? Which observations in the data can be found frequently; that is, which patterns or rules can we identify within the data? How do we assess what is frequent; that is, what are the good measures of relevance and how do we test for it? On a very high level, frequent pattern mining addresses precisely these questions. While it's very easy to dive head first into more advanced machine learning techniques, these pattern mining algorithms can be quite informative and help build an intuition about the data. To introduce some of the key notions of frequent pattern mining, let's first consider a somewhat prototypical example for such cases, namely shopping carts. The study of customers being interested in and buying certain products has been of prime interest to marketers around the globe for a very long time. While online shops certainly do help in further analyzing customer behavior, for instance, by tracking the browsing data within a shopping session, the question of what items have been bought and what patterns in buying behavior can be found applies to purely offline scenarios as well. We will see a more involved example of clickstream data accumulated on a website soon; for now, we will work under the assumption that only the events we can track are the actual payment transactions of an item. Just this given data, for instance, for groceries shopping carts in supermarkets or online, leads to quite a few interesting questions, and we will focus mainly on the following three: Which items are frequently bought together? For instance, there is anecdotal evidence suggesting that beer and diapers are often brought together in one shopping session. Finding patterns of products that often go together may, for instance, allow a shop to physically place these products closer to each other for an increased shopping experience or promotional value even if they don't belong together at first sight. In the case of an online shop, this sort of analysis might be the base for a simple recommender system. Based on the previous question, are there any interesting implications or rules to observe in shopping behavior?, continuing with the shopping cart example, can we establish associations such as if bread and butter have been bought, we also often find cheese in the shopping cart? Finding such association rules can be of great interest, but also need more clarification of what we consider to be often, that is, what does frequent mean. Note that, so far, our shopping carts were simply considered a bag of items without additional structure. At least in the online shopping scenario, we can endow data with more information. One aspect we will focus on is that of the sequentiality of items; that is, we will take note of the order in which the products have been placed into the cart. With this in mind, similar to the first question, one might ask, which sequence of items can often be found in our transaction data? For instance, larger electronic devices bought might be followed up by additional utility items. The reason we focus on these three questions, in particular, is that Spark MLlib comes with precisely three pattern mining algorithms that roughly correspond to the aforementioned questions by their ability to answer them. Specifically, we will carefully introduce FP- growth, association rules, and prefix span, in that order, to address these problems and show how to solve them using Spark. Before doing so, let's take a step back and formally introduce the concepts we have been motivated for so far, alongside a running example. We will refer to the preceding three questions throughout the following subsection. Pattern mining terminology We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions. To give a concrete example of this, consider the following situation. Assume that the full item set to shop from is given by I = {bread, cheese, ananas, eggs, donuts, fish, pork, milk, garlic, ice cream, lemon, oil, honey, jam, kale, salt}. Since we will look at a lot of item subsets, to make things more readable later on, we will simply abbreviate these items by their first letter, that is, we'll write I = {b, c, a, e, d, f, p, m, g, i, l, o, h, j, k, s}. Given these items, a small transaction database D could look as follows:   Transaction ID Transaction 1 a, c, d, f, g, i, m, p 2 a, b, c, f, l, m, o 3 b, f, h, j, o 4 b, c, k, s, p 5 a, c, e, f, l, m, n, p Table 1: A small shopping cart database with five transactions Frequent pattern mining problem Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D: supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D| We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items. Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance. Given a minimum support threshold, the problem of finding all the frequent patterns is called the frequent pattern mining problem and it is, in fact, the formalized version of the aforementioned first question. Continuing with our example, we have found all frequent patterns of length 1 for t = 0.6 already. How do we find longer patterns? On a theoretical level, given unlimited resources, this is not much of a problem, since all we need to do is count the occurrences of items. On a practical level, however, we need to be smart about how we do so to keep the computation efficient. Especially for databases large enough for Spark to come in handy, it can be very computationally intense to address the frequent pattern mining problem. One intuitive way to go about this is as follows: Find all the frequent patterns of length 1, which requires one full database scan. This is how we started with in our preceding example. For patterns of length 2, generate all the combinations of frequent 1-patterns, the so-called candidates, and test if they exceed the minimum support by doing another scan of D. Importantly, we do not have to consider the combinations of infrequent patterns, since patterns containing infrequent patterns can not become frequent. This rationale is called the apriori principle. For longer patterns, continue this procedure iteratively until there are no more patterns left to combine. This algorithm, using a generate-and-test approach to pattern mining and utilizing the apriori principle to bound combinations, is called the apriori algorithm. There are many variations of this baseline algorithm, all of which share similar drawbacks in terms of scalability. For instance, multiple full database scans are necessary to carry out the iterations, which might already be prohibitively expensive for huge datasets. On top of that, generating candidates themselves is already expensive, but computing their combinations might simply be infeasible. In the next section, we will see how a parallel version of an algorithm called FP-growth, available in Spark, can overcome most of the problems just discussed. The association rule mining problem To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows: conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1) This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity. In our running example, the pattern {f, m} occurs three times, while {f, m, p} is just present in two cases, which means that the rule {f, m} ⇒ {p} has confidence 2/3. If we set the minimum confidence threshold to t = 0.6, we can easily check that the following association rules with an antecedent and consequent of length 1 are valid for our case: {a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p} {c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p} {f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m} {m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p} {p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m} From the preceding definition of confidence, it should now be clear that it is relatively straightforward to compute the association rules once we have the support value of all the frequent patterns. In fact, as we will soon see, Spark's implementation of association rules is based on calculating frequent patterns upfront. [box type="info" align="" class="" width=""]At this point, it should be noted that while we will restrict ourselves to the measures of support and confidence, there are many other interesting criteria available that we can't discuss in this book; for instance, the concepts of conviction, leverage, or lift. For an in-depth comparison of the other measures, refer to http:/ / www. cse. msu. edu/ ~ptan/ papers/ IS. pdf.[/box] The sequential pattern mining problem Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows: s = <s1, s2, ..., sl> Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example:   Sequence ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Table 2: A small sequence database with four short sequences. In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u1, ..., un) a subsequence of s = (s1, ..., sl) and write u < s if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following: u1 < si1, ..., un < sin Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>. With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows: suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S| Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t. Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section. Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced. If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.
Read more
  • 0
  • 0
  • 5041

article-image-classification-decision-trees-apache-spark-mllib
Wilson D'souza
02 Nov 2017
9 min read
Save for later

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza
02 Nov 2017
9 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster:  Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file:  val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset:  val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count:  println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset:  val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset:  def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters:  val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session:  spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.
Read more
  • 0
  • 0
  • 26234

article-image-halloween-costume-data-science-nerds
Packt Editorial Staff
31 Oct 2017
14 min read
Save for later

(13*3)+ Halloween costume ideas for Data science nerds

Packt Editorial Staff
31 Oct 2017
14 min read
Are you a data scientist, a machine learning engineer, an AI researcher or simply a data enthusiast? Channel the inner data science nerd within you with these geeky ideas for your Halloween costumes! The Data Science Spectrum Don't know what to go as to this evening's party because you've been busy cleaning that terrifying data? Don’t worry, here are some easy-to-put-together Halloween costume ideas just for you. [dropcap]1[/dropcap] Big Data Go as Baymax, the healthcare robot, (who can also turn into battle mode when required). Grab all white clothes that you have. Stuff your tummy with some pillows and wear a white mask with cutouts for eyes. You are all ready to save the world. In fact, convince a friend or your brother to go as Hiro! [dropcap]2[/dropcap] A.I. agent Enter as Agent Smith, the AI antagonist, this Halloween. Lure everyone with your bold black suit paired with a white shirt and a black tie. A pair of polarized sunglasses would replicate you as the AI agent. Capture the crowd by being the most intelligent and cold-hearted personality of all. [dropcap]3[/dropcap] Data Miner Put on your dungaree with a tee. Fix a flashlight atop your cap. Grab a pickaxe from the gardening toolkit, if you have one. Stripe some mud onto your face. Enter the party wheeling with loads of data boxes that you have freshly mined. You’ll definitely grab some traffic for data. Unstructured data anyone? [dropcap]4[/dropcap] Data Lake Go as a Data lake this Halloween. Simply grab any blue item from your closet. Draw some fishes, crabs, and weeds. (Use a child’s marker for that). After all, it represents the data you have. And you’re all set. [dropcap]5[/dropcap] Dark Data Unleash the darkness within your soul! Just kidding. You don’t actually have to turn to the evil side. Just coming up with your favorite black-costume character would do. Looking for inspiration? Maybe, a witch, The dark knight, or The Darth Vader. [dropcap]6[/dropcap] Cloud A fluffy, white cloud is what you need to be this Halloween. Raid your nearby drug store for loads of cotton balls. Better still, tear up that old pillow you have been meaning to throw away for a while. Use the fiber inside to glue onto an unused tee. You will be the cutest cloud ever seen. Don’t forget to carry an umbrella in case you turn grey! [dropcap]7[/dropcap] Predictive Analytics Make your own paper wizard hat with silver stars and moons pasted on it. If you can arrange for an advocate gown, it would be great. Else you could use a long black bed sheet as a cape. And most importantly, a crystal ball to show off some prediction stunts at the Halloween. [dropcap]8[/dropcap] Gradient boosting Enter Halloween as the energy booster. Wear what you want. Grab loads of empty energy drink tetra packs and stick it all over you. Place one on your head too. Wear a nameplate that says “ G-booster Energy drink”. Fuel up some weak models this Halloween. [dropcap]9[/dropcap] Cryptocurrency Wear head to toe black. In fact, paint your face black as well, like the Grim reaper. Then grab a cardboard piece. Cut out a circle, paint it orange, and then draw a gold B symbol, just like you see in a bitcoin. This Halloween costume will definitely grab you the much-needed attention just as this popular cryptocurrency. [dropcap]10[/dropcap] IoT Are you a fan of IoT and the massive popularity it has gained? Then you should definitely dress up as your web-slinging, friendly neighborhood Spiderman. Just grab a spiderman costume from any costume store and attach some handmade web slings. Remember to connect with people by displaying your IoT knowledge. [dropcap]11[/dropcap] Self-driving car Choose a mono-color outfit of your choice (P.S. The color you would choose for your car). Cut out four wheels and paste two on your lower calves and two on your arms. Cut out headlights too. Put on a wiper goggle. And yes you do not need a steering wheel or the brakes, clutch and the accelerator. Enter the Halloween at your own pace, go self-driving this Halloween. Bonus point: You can call yourself Bumblebee or Optimus Prime. Machine Learning and Deep learning Frameworks If machine learning or deep learning is your forte, here are some fresh Halloween costume ideas based on some of the popular frameworks in that space. [dropcap]12[/dropcap] Torch Flame up the party with a costume inspired by the fantastic four superhero, Johnny Storm a.k.a The Human Torch. Wear a yellow tee and orange slacks. Draw some orange flames on your tee. And finally, wear a flame-inspired headband. Someone is a hot machine learning library! [dropcap]13[/dropcap] TensorFlow No efforts for this one. Just arrange for a pumpkin costume, paste a paper cut-out of the TensorFlow logo and wear it as a crown. Go as the most powerful and widely popular deep learning library. You will be the star of the Halloween as you are a Google Kid. [dropcap]14[/dropcap] Caffe Go as your favorite Starbucks coffee this Halloween. Wear any of your brown dress/ tee. Draw or stick a Starbucks logo. And then add frothing to the top by bunching up a cream-colored sheet. Mamma Mia! [dropcap]15[/dropcap] Pandas Go as a Panda this Halloween! Better still go as a group of Pandas. The best option is to buy a panda costume. But if you don’t want that, wear a white tee, black slacks, black goggles and some cardboard cutouts for ears. This will make you not only the cutest animal in the party but also a top data manipulation library. Good luck finding your python in the party by the way. [dropcap]16[/dropcap] Jupyter Notebook Go as a top trending open-source web application by dressing up as the largest planet in our solar system. People would surely be intimidated by your mass and also by your computing power. [dropcap]17[/dropcap] H2O Go to Halloween as a world famous open source deep learning platform. No, no, you don’t have to go as the platform itself. Instead go as the chemical alter-ego, water. Wear all blue and then grab some leftover asymmetric, blue cloth pieces to stick at your sides. Thirsty anyone? Data Viz & Analytics Tools If you’re all about analytics and visualization, grab the attention of every data geek in your party by dressing up as your favorite data insight tools. [dropcap]18[/dropcap] Excel Grab an old white tee and paint some green horizontal stripes. You’re all ready to go as the most widely used spreadsheet. The simplest of costumes, yet the most useful - a timeless classic that never goes out of fashion. [dropcap]19[/dropcap] MatLab If you have seriously run out of all costume ideas, going out as MatLab is your only solution. Just grab a blue tablecloth. Stick or sew it with some orange curtain and throw it over your head. You’re all ready to go as the multi-paradigm numerical computing environment. [dropcap]20[/dropcap] Weka Wear a brown overall, a brown wig, and paint your face brown. Make an orange beak out of a chart paper, and wear a pair orange stockings/ socks with your trousers tucked in. You are all set to enter as a data mining bird with ML algorithms and Java under your wings. [dropcap]21[/dropcap] Shiny Go all Shimmery!! Get some glitter powder and put it all over you. (You’ll have a tough time removing it though). Else choose a glittery outfit, with glittery shoes, and touch-up with some glitter on your face. Let the party see the bling of R that you bring. You will be the attractive storyteller out there. [dropcap]22[/dropcap] Bokeh A colorful polka-dotted outfit and some dim lights to do the magic. You are all ready to grab the show with such a dazzle. Make sure you enter the party gates with Python. An eye-catching beauty with the beast pair. [dropcap]23[/dropcap] Tableau Enter the Halloween as one of your favorite characters from history. But there is a term and condition for this: You cannot talk or move. Enjoy your Halloween by being still. Weird, but you’ll definitely grab everyone’s eye. [dropcap]24[/dropcap] Microsoft Power BI Power up your Halloween party by entering as a data insights superhero. Wear a yellow turtleneck, a stylish black leather jacket, black pants, some mid-thigh high boots and a slick attitude. You’re ready to save your party! Data Science oriented Programming languages These hand-picked Halloween costume ideas are for you if you consider yourself a top coder. By a top coder we mean you’re all about learning new programming languages in your spare and, well, your not so spare time.   [dropcap]25[/dropcap] Python Easy peasy as the language looks, the reptile is not that easy to handle. A pair of python-printed shirt and trousers would do the job. You could be getting more people giving you candies some out of fear, other out of the ease. Definitely, go as a top trending and a go-to language which everyone loves! And yes, don’t forget the fangs. [dropcap]26[/dropcap] R Grab an eye patch and your favorite leather pants. Wear a loose white shirt with some rugged waistcoat and a sword. Here you are all decked up as a pirate for your next loot. You’ll surely thank me for giving you a brilliant Halloween idea. But yes! Don’t forget to make that Arrrr (R) noise! [dropcap]27[/dropcap] Java Go as a freshly roasted coffee bean! People in your Halloween party would be allured by your aroma. They would definitely compliment your unique idea and also the fact that you’re the most popular programming language. [dropcap]28[/dropcap] SAS March in your Halloween party up as a Special Airforce Service (SAS) agent. You would be disciplined, accurate, precise and smart. Just like the advanced software suite that goes by the same name. You would need a full black military costume, with a gas mask, some fake ammunition from a nearby toy store, and some attitude of course! [dropcap]29[/dropcap] SQL If you pride yourself on being very organized or are a stickler for the rules, you should go as SQL this Halloween. Prep-up yourself with an overall blue outfit. Spike up your hair and spray some temporary green hair color. Cut out bold letters S, Q, and L from a plain white paper and stick them on your chest. You are now ready to enter the Halloween party as the most popular database of all times. Sink in all the data that you collect this Halloween. [dropcap]30[/dropcap] Scala If Scala is your favorite programming language, add a spring to your Halloween by going as, well, a spring! Wear the brightest red that you have. Using a marker, draw some swirls around your body (You can ask your mom to help). Just remember to elucidate a 3D picture. And you’re all set. [dropcap]31[/dropcap] Julia If you want to make a red carpet entrance to your Halloween party, go as the Academy award-winning actress, Julia Roberts. You can even take up inspiration from her character in the 90s hit film Pretty Woman. For extra oomph, wear a pink, red, and purple necklace to highlight the Julia programming language [dropcap]32[/dropcap] Ruby Act pricey this Halloween. Be the elegant, dynamic yet simple programming language. Go blood red, wear on your brightest red lipstick, red pumps, dazzle up with all the red accessories that you have. You’ll definitely gather some secret admirers around the hall. [dropcap]33[/dropcap] Go Go as the mascot of Go, the top trending programming language. All you need is a blue mouse costume. Fear not if you don’t have one. Just wear a powder blue jumpsuit, grab a baby pink nose, and clip on a fake single, large front tooth. Ready for the party! [dropcap]34[/dropcap] Octave Go as a numerically competent programming language. And if that doesn’t sound very trendy, go as piano keys depicting an octave. You simply need to wear all white and divide your space into 8 sections. Then draw 5 horizontal black stripes. You won’t be able to do that vertically, well, because they are a big number. Here you go, you’re all set to fill the party with your melody. Fancy an AI system inspired Halloween costume? This is for you if you love the way AI works and the enigma that it has thrown around the world. This is for you if you are spellbound with AI magic. You should go dressed as one of these at your Halloween party this season. Just pick up the AI you want to look like and follow as advised. [dropcap]35[/dropcap] IBM Watson Wear a dark blue hat, a matching long overcoat, a vest and a pale blue shirt with a dark tie tucked into the vest. Complement it with a mustache and a brooding look. You are now ready to be IBM Watson at your Halloween party. [dropcap]36[/dropcap] Apple Siri If you want to be all cool and sophisticated like the Apple’s Siri, wear an alluring black turtleneck dress. Don’t forget to carry your latest iPhone and air pods. Be sure you don’t have a sore throat, in case someone needs your assistance. [dropcap]37[/dropcap] Microsoft Cortana If Microsoft Cortana is your choice of voice assistant, dress up as Cortana, the fictional synthetic intelligence character in the Halo video game series. Wear a blue bodysuit. Get a bob if you’re daring. (A wig would also do). Paint some dark blue robot like designs over your body and well, your face. And you’re all set. [dropcap]38[/dropcap] Salesforce Einstein Dress up as the world’s most famous physicist and also an AI-powered CRM. How? Just grab a white shirt, a blue pullover and a blue tie (Salesforce colors). Finish your look with a brown tweed coat, brown pants and shoes, a rugged white wig and mustache, and a deep thought on your face. [dropcap]39[/dropcap] Facebook Jarvis Get inspired by the Iron man’s Jarvis, the coolest A.I. in the Marvel universe. Just grab a plexiglass, draw some holograms and technological symbols over it with a neon marker. (Try to keep the color palette in shades of blues and reds). And fix this plexiglass in a curved fashion in front of your face by a headband. Do practice saying “Hello Mr. Stark.”  [dropcap]40[/dropcap] Amazon Echo This is also an easy one. Grab a long, black chart paper. Roll it around in a tube form around your body. Draw the Amazon symbol at the bottom with some glittery, silver sketch pen, color your hair blue, and there you go. If you have a girlfriend, convince her to go as Amazon Alexa. [dropcap]41[/dropcap] SAP Leonardo Put on a hat, wear a long cloak, some fake overgrown mustache, and beard. Accessorize with a color palette and a paintbrush. You will be the Leonardo da Vinci of the Halloween party. Wait a minute, don’t forget to cut out SAP initials and stick them on your cap. After all, you are entering as SAP’s very own digital revolution system. [dropcap]42[/dropcap] Intel Neon Deck the Halloween hall with a Harley Quinn costume. For some extra dramatization, roll up some neon blue lights around your head. Create an Intel logo out of some blue neon lights and wear it as your neckpiece. [dropcap]43[/dropcap] Microsoft Brainwave This one will require a DIY task. Arrange for a red and green t-shirt, cut them into a vertical half. Stitch it in such a way that the green is on the left and the red on the right. Similarly, do that with your blue and yellow pants; with yellow on the left and blue on the right. You will look like the most powerful Microsoft’s logo. Wear a skullcap with wires protruding out and a Hololens like eyewear to go with. And so, you are all ready to enter the Halloween party as Microsoft’s deep learning acceleration platform for real-time AI. [dropcap]44[/dropcap] Sophia, the humanoid Enter with all the confidence and a top-to-toe professional attire. Be ready to answer any question thrown at you with grace and without a stroke of skepticism. And to top it off, sport a clean shaved head. And there, you are all ready to blow off everyone’s mind with a mix of beauty with super intelligent brains.   Happy Halloween folks!
Read more
  • 0
  • 0
  • 29251

article-image-building-motion-charts-tableau
Ashwin Nair
31 Oct 2017
4 min read
Save for later

Building Motion Charts with Tableau

Ashwin Nair
31 Oct 2017
4 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Tableau 10 Bootcamp, Chapter 2, Interactivity – written by Joshua N. Milligan and Donabel Santos. It offers intensive training on Data Visualization and Dashboarding with Tableau 10. In this article, we will learn how to build motion charts with Tableau.[/box] Tableau is an amazing platform for achieving incredible data discovery, analysis, and Storytelling. It allows you to build fully interactive dashboards and stories with your visualizations and insights so that you can share the data story with others. Creating Motion Charts with Tableau Let`s learn how to build motion charts with Tableau. A motion chart, as its name suggests, is a chart that displays the entire trail of changes in data over time by showing movement using the X and Y-axes. It is very much similar to the doodles in our notebooks which seem to come to life after flipping through the pages. It is amazing to see the same kind of movement in action in Tableau using the Pagesshelf. It is work that feels like play. On the Pages shelf, when you drop a field, Tableau creates a sequence of pages that filters the view for each value in that field. Tableau's page control allows us to flip pages, enabling us to see our view come to life. With three predefined speed settings, we can control the speed of the flip. The three settings include one that relates to the slowest speed, the others to the fastest speed. We can also format the marks and show the marks or trails, or both, using page control. In our viz, we have used a circle for marking each year. The circle that moves to a new position each year represents the specific country's new population value. These circles are all connected by trail lines that enable us to simulate a moving time series graph by setting the  mark and trail histories both to show in page control: Let's create an animated motion chart showing the population change over the years for a selected few countries: Open the Motion Chart worksheet and connect to the CO2 (Worldbank) data Source: Open Dimensions and drag Year to the Columns shelf. Open Measures and drag CO2 Emission to the Rows shelf. Right-click on the CO2 Emission axis, and change the title to CO2 Emission (metric tons per capita): In the Marks card, click on the dropdown to change the mark from Automatic to Circle. Open Dimensions and drag Country Name to Color in the Marks card. Also, drag Country Name to the Filter shelf from Dimensions Under the General tab of the Filter window, while the Select from list radio button is selected, select None. Select the Custom value list radio button, still under the General tab, and add China, Trinidad and Tobago, and United States: Click OK when done. This should close the Filter window. Open Dimensions and drag Year to Pages for adding a page control to the view. Click on the Show history checkbox to select it. Click on the drop-down beside Show history and perform the following steps: Select All for Marks to show history for Select Both for Show Using the Year page control, click on the forward arrow to play. This shows the change in the population of the three selected countries over the years. [box type="info" align="" class="" width=""]Tip -  In case you ever want to loopback the animation, you can click on the dropdown on the top-right of your page control card, and select Loop Playback:[/box] Note that Tableau Server does not support the animation effect that you see when working on motion charts with Tableau Desktop. Tableau strives for zero footprints when serving the charts and dashboards on the server so that there is no additional download to enable the functionalities. So, the play control does not work the same. No need to fret though. You can click manually on the slider and have a similar effect.  If you liked the above excerpt from the book Tableau 10 Bootcamp, check out the book to learn more data visualization techniques.
Read more
  • 0
  • 0
  • 10981
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-halloween-costume-ideas-inspired-apache-big-data-projects
Packt Editorial Staff
30 Oct 2017
3 min read
Save for later

Halloween costume ideas inspired from Apache Big Data Projects

Packt Editorial Staff
30 Oct 2017
3 min read
If you are a busy person who is finding it difficult to decide a Halloween costume for your office party tomorrow or for your kid's trick-or-treating madness, here are some geeky Halloween costume ideas that will make the inner data nerd in you proud! Apache Hadoop Be the cute little yellow baby elephant everyone wants to cuddle. Just grab all the yellow clothes you have. If you don’t, borrow them. Don’t forget to stuff in mini cushions in you. Pop in loads of candy in your mouth. And there, you’re all set to be as the dominant but the cutest framework! Cuteness overloaded. Apache Hive Be the buzz of your Halloween party by going as a top Apache data warehouse. What to wear you ask? Hum around wearing a yellow and white striped dress or a shirt. Compliment your outfit with a pair of black wings, headband with antennae and a small pot of honey.   Apache Storm An X-Men fan are you? Go as Storm, the popular fictional superhero. Wear a black bodysuit (leather if possible). Drape a long cape. Put on a grey wig. And channel your inner power. Perhaps people would be able to see the powerful weather-controlling mutant in you and also recognize your ability to process streaming data in real time. Apache Kafka Go all out gothic with an Apache Kafka costume. Dress in a serious black dress and gothic makeup. Don’t forget your black butterfly wings and a choker necklace with linked circles. Keep asking existential questions to random people at the party to throw them off balance. Apache Giraph Put on a yellow tee and brown trousers, cut out some brown imperfect circles and paste them on your tee. Put on a brown cap, and paint your ears brown. Draw some graph representations using a marker all over your hands and palms. You are now Apache Giraph. Apache Singa Be the blend of a flexible Apache Singa with the ferocity of a lion this Halloween! All you need is a yellow tee paired with light brown trousers. Wear a lion’s wig. Grab a mascara and draw some strokes on your cheeks. Paint the tip of your nose using a brown watercolour or some melted chocolate. Apache Spark If you have obsessed over Pokémon Go and equally love the lightning blaze data processing speed of Apache Spark, you should definitely go as the leader of Pokémon Go's Team Instinct. Spark wears an orange hoodie, a black and yellow leather jacket, black jeans and orange gloves. Do remember to carry your Pokemon balls in case you are challenged for a battle. Apache Pig A dark blue dungaree paired with a baby pink tee, a pair of white gloves, purple shoes and yes, a baby pink chart paper cut out of the pig’s face. Wear all of this on and you will look like an Apache Pig. Complement the look with a wide grin when you make an entrance. [caption id="attachment_1414" align="aligncenter" width="708"] Two baby boys dressed in animal costumes in autumn park, focus on baby in elephant costume[/caption] Happy Haloween folks! Watch this space for more data science themed Haloween costume ideas tomorrow.  
Read more
  • 0
  • 0
  • 15212

article-image-implementing-autoencoders-using-h2o
Amey Varangaonkar
27 Oct 2017
4 min read
Save for later

Implementing Autoencoders using H2O

Amey Varangaonkar
27 Oct 2017
4 min read
[box type="note" align="" class="" width=""]This excerpt is taken from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this article, we see how R is an effective tool for neural network modelling, by implementing autoencoders using the popular H2O library.[/box] An autoencoder is an ANN used for learning without efficient coding control. The purpose of an autoencoder is to learn coding for a set of data, typically to reduce dimensionality. Architecturally, the simplest form of autoencoder is an advanced and non-recurring neural network very similar to the MLP, with an input level, an output layer, and one or more hidden layers that connect them, but with the layer outputs having the same number of input level nodes for rebuilding their inputs. In this section, we present an example of implementing Autoencoders using H2O on a movie dataset. The dataset used in this example is a set of movies and genre taken from https://grouplens.org/datasets/movielens We use the movies.csv file, which has three columns: movieId title genres There are 164,979 rows of data for clustering. We will use h2o.deeplearning to have the autoencoder parameter fix the clusters. The objective of the exercise is to cluster the movies based on genre, which can then be used to recommend similar movies or same genre movies to the users. The program uses h20.deeplearning, with the autoencoder parameter set to T: library("h2o") setwd ("c://R") #Load the training dataset of movies movies=read.csv ( "movies.csv", header=TRUE) head(movies) model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") summary(model) features=h2o.deepfeatures(model, as.h2o(movies), layer=1) d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) Now, let's go through the code: library("h2o") setwd ("c://R") These commands load the library in the R environment and set the working directory where we will have inserted the dataset for the next reading. Then we load the data: movies=read.csv( "movies.csv", header=TRUE) To visualize the type of data contained in the dataset, we analyze a preview of one of these variables: head(movies) The following figure shows the first 20 rows of the movie dataset: Now we build and train model: model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") Let's analyze some of the information contained in model: summary(model) This is an extract from the results of the summary() function: In the next command, we use the h2o.deepfeatures() function to extract the nonlinear feature from an h2o dataset using an H2O deep learning model: features=h2o.deepfeatures(model, as.h2o(movies), layer=1) In the following code, the first six rows of the features extracted from the model are shown: > features DF.L1.C1 DF.L1.C2 1 0.2569208 -0.2837829 2 0.3437048 -0.2670669 3 0.2969089 -0.4235294 4 0.3214868 -0.3093819 5 0.5586608 0.5829145 6 0.2479671 -0.2757966 [9125 rows x 2 columns] Finally, we plot a diagram where we want to see how the model grouped the movies through the results obtained from the analysis: d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) The plot of the movies, once clustering is done, is shown next. We have plotted only 100 movie titles due to space issues. We can see some movies being closely placed, meaning they are of the same genre. The titles are clustered based on distances between them, based on genre. Given a large number of titles, the movie names cannot be distinguished, but what appears to be clear is that the model has grouped the movies into three distinct groups. If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.
Read more
  • 0
  • 1
  • 38655

article-image-top-5-machine-learning-movies
Chris Key
17 Oct 2017
3 min read
Save for later

Top 5 Machine Learning Movies

Chris Key
17 Oct 2017
3 min read
Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.
Read more
  • 0
  • 0
  • 4394

article-image-how-mine-bitcoin-your-raspberry-pi
Raka Mahesa
12 Oct 2017
5 min read
Save for later

How to mine bitcoin with your Raspberry Pi

Raka Mahesa
12 Oct 2017
5 min read
Bitcoin is big - you probably know that already. Maybe you know someone who has made a bit of money by mining it on their computer. If getting started sounds like a bit of a headache and you're not sure where to start, dig out your Raspberry Pi. You might be surprised, but Raspberry Pi is a tool to use for Bitcoin mining. Before we go further, let's make sure we understand all of the aspects involved. After all, Raspberry Pi and Bitcoin mining are quite advanced topics and not some technological terms that you read every day. So with that in mind, let's take a quick refresher course. Okay, so let's start with the easier topic: Raspberry Pi. Basically, Raspberry Pi is a computer with a very, very small size and sold at a very, very low price. Despite the size and the price, Raspberry Pi is a full-fledged computer that you can use like any computer out there, and of course, this includes Bitcoin mining. How do you mine Bitcoin? There are two ways you can mine Bitcoin. You can either use consumer-grade, general hardware like a CPU to solve calculations, or you can use hardware customized for mining Bitcoin. These are called ASIC miners. These ASIC miners can mine Bitcoin much, much more efficiently than general-purpose hardware. In fact, these days, profitable Bitcoin mining operations can only be done using those ASIC miners. Since this post is about Bitcoin, we're going to use an ASIC miner for our mining operation. But keep in mind that there are some cryptocurrencies, like Ethereum for example, that you can't mine with an ASIC miner. Since each cryptocurrency is different, it's best to research them separately and not assume that what works with Bitcoin will also work with another cryptocurrency. Is Raspberry Pi Bitcoin mining profitable? The bad news is that mining Bitcoin with a Raspberry Pi isn't that profitable. As we've touched upon already, the main expense of mining Bitcoin is the cost of the electricity needed to run the hardware. This means your hardware needs to be efficient enough to earn Bitcoin that exceeds the value of your electricity costs. Unfortunately, your Raspberry Pi isn't powerful enough to deliver this sort of return. So, why would you even want to start Raspberry Pi Bitcoin mining? Well, for one, it would make a fun side project and you'll learn a lot from doing it. And don't say it too loud, but if you have 'free' electricity (maybe you live in a dorm, for example), that could easily mean earn Bitcoin without spending much at all. Mining Bitcoin with Raspberry Pi Okay, enough talk, let's actually do some mining. To mine Bitcoin with Raspberry Pi, you're going to need: Raspberry Pi USB Bitcoin ASIC Miner Powered USB Hub Having a powered USB Hub is important, because Raspberry Pi can only supply a limited amount of power to a connected USB device. Since a USB ASIC miner can draw a lot of power, using an external power source would solve the power problem. Not to mention that with a USB hub you can connect more than a single ASIC miner to the Raspberry Pi. There are 2 more things to do before we can start mining. The first one is to set up a Bitcoin wallet, a place to store all the Bitcoin we're going to get. The other one is to join a Bitcoin mining pool. By joining a Bitcoin mining pool, you no longer need to single-handedly finish the entire Bitcoin block calculation to earn Bitcoin. Instead, you can earn Bitcoin by just solving a part of the calculation, since now you are working as a group. All right, the next thing we want to set up is the mining software. For this one we're going to use BFGMiner, a popular mining software focused on mining with ASIC miner instead of CPU/GPU.  To install BFGMiner, you need to install a couple of additional libraries to your Raspberry Pi. You can do this by executing the following commands on the LXTerminal assuming you're using Raspbian operating system: sudo apt-get update sudo apt-get install autoconfautogenlibtooluthash-dev libjansson-dev libcurl4-openssl-dev libusb-dev libncurses-dev git-core –y With the library set up, you can install BFGMiner by executing these lines: git clone https://github.com/luke-jr/bfgminer.git cd bfgminer ./autogen.sh ./configure make And now, to actually start the mining operation, connect BFGMiner with your mining pool account and run the application. It can be done by running the following command: ./bfgminer -o <http://pool:port> -u <username> -p <password> And that's it! Now your Raspberry Pi will use the ASIC miner attached to it and automatically mine Bitcoin. The field of cryptocurrency is a vast one, and this little project we've just finished is nothing but a little peek into that field. There are other cryptocurrencies, or other mining methods that you can use to gain profit more effectively. About the author Raka Mahesa is a game developer at Chocoarts (http://chocoarts.com/), who is interested in digital technology in general. Outside of work hours, he likes to work on his own projects, with Corridoom VR being his latest released game. Raka also regularly tweets as @legacy99.
Read more
  • 0
  • 0
  • 35776
article-image-bootstrap-4-objects-components-flexbox-and-layout
Packt
21 Aug 2017
14 min read
Save for later

Bootstrap 4 Objects, Components, Flexbox, and Layout

Packt
21 Aug 2017
14 min read
In this article by Ajdin Imsirovic author of the book Bootstrap 4 Cookbook, we have three recipes from the book, in which we will be looking at using CSS to override Bootstrap 4 styling and create customized blockquotes. Next. we will look at how to utilize SCSS to control the number of card columns at different screen sizes. We will wrap it up with the third recipe, in which we will look at classes that Bootstrap 4 uses to implement flex-based layouts. Specifically, we will switch the flex direction of card components, based on the screen size. (For more resources related to this topic, see here.) Customizing the blockquote element with CSS In this recipe, we will examine how to use and modify Bootstrap's blockquote element. The technique we'll employ is using the :before and :after CSS pseudo-classes. We will add HTML entities to the CSS content property, and then style their position, size, and color. Getting ready Navigate to the recipe4 page of the chapter 3 website, and preview the final result that we are trying to achieve (its preview is available in chapter3-complete/app, after running harp server in the said folder). To get this look, we are using all the regular Bootstrap 4 CSS classes, with the addition of .bg-white, added in the preceding recipe. In this recipe, we will add custom styles to .blockquote. How to do it... In the empty chapter3/start/app/recipe4.ejs file, add the following code: <div class="container mt-5"> <h1>Chapter 3, Recipe 4:</h1> <p class="lead">Customize the Blockquote Element with CSS</p> </div> <!-- Customizing the blockquote element --> <div class="container"> <div class="row mt-5 pt-5"> <div class="col-lg-12"> <blockquote class="blockquote"> <p>Blockquotes can go left-to-right. Lorem ipsum dolor sit amet, consectetur adipisicing elit. Repellat dolor pariatur, distinctio doloribus aliquid recusandae soluta tempore. Vero a, eum.</p> <footer class="blockquote-footer">Some Guy, <cite>A famous publication</cite> </footer> </blockquote> </div> <div class="col-lg-12"> <blockquote class="blockquote blockquote-reverse bg-white"> <p>Blockquotes can go right-to-left. Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quisquam repellendus sequi officia nulla quaerat quo.</p> <footer class="blockquote-footer">Another Guy, <cite>A famous movie quote</cite> </footer> </blockquote> </div> <div class="col-lg-12"> <blockquote class="blockquote card-blockquote"> <p>You can use the <code>.card-blockquote</code> class. Lorem ipsum dolor sit amet, consectetur adipisicing elit. Aliquid accusamus veritatis quasi.</p> <footer class="blockquote-footer">Some Guy, <cite>A reliable source</cite> </footer> </blockquote> </div> <div class="col-12"> <blockquote class="blockquote bg-info"> <p>Blockquotes can go left-to-right. Lorem ipsum dolor sit amet. </p> <footer class="blockquote-footer">Some Guy, <cite>A famous publication</cite> </footer> </blockquote> </div> </div> </div> In main-03-04.scss, add the following code: blockquote.blockquote { padding: 2rem 2rem 2rem 4rem; margin: 2rem; quotes: "201C" "201D"; position: relative; } blockquote:before { content: open-quote; font-family: Georgia, serif; font-size: 12rem; opacity: .04; font-weight: bold; position:absolute; top:-6rem; left: 0; } blockquote:after { content: close-quote; font-size: 12rem; opacity: .04; font-family: Georgia, serif; font-weight: bold; position:absolute; bottom:-11.3rem; right: 0; } In main.scss, uncomment @include for main-03-04.scss. Run grunt sass and harp server. How it works... In this recipe, we are using the regular blockquote HTML element and Bootstrap's classes for styling it. To make it look different, we primarily use the following tweaks: Setting the blockquote.blockquote position to relative Setting the :before and :after pseudo-classes, position to absolute In blockquote.blockquote, setting the padding and margin. Also, assigning the values for opening and closing quotes, using CSS (ISO) encoding for the two HTML entities Using Georgia font to style the content property in pseudo-classes Setting the font-size of pseudo-classes to a very high value and giving the font a very high opacity, so as to make it become more background-like With absolute positioning in place, it is easy to place the quotes in the exact location, using negative rem values Controlling the number of card columns on different breakpoints with SCSS This recipe will involve some SCSS mixins, which will alter the behavior of the card-columns component. To be able to showcase the desired effect, we will have to have a few hundred lines of compiled HTML code. This poses an issue; how do we show all that code inside a recipe? Here, Harp partials come to the rescue! Since most of the code in this recipe is repetitive, we will make a separate file. The file will contain the code needed to make a single card. Then, we will have a div with the class of card-columns, and this div will hold 20 cards, which will, in fact, be 20 calls to the single card file in our source code before compilation. This will make it easy for us to showcase how the number of cards in this card-columns div will change, based on screen width. To see the final result, open the chapter4/complete code's app folder, and run the console (that is, bash) on it. Follow it up with the harp server command, and navigate to localhost:9000 in your browser to see the result we will achieve in this recipe.  Upon opening the web page as explained in the preceding paragraph, you should see 20 cards in a varying number of columns, depending on your screen size. Getting ready To get acquainted with how card-columns work, navigate to the card-columns section of the Bootstrap documentation at https://v4-alpha.getbootstrap.com/components/card/#card-columns. How to do it… Open the currently empty file located at chapter4start/app/recipe04-07.ejs, and add the following code: <div class="container-fluid"> <div class="mt-5"> <h1><%- title %></h1> <p><a href="https://v4-alpha.getbootstrap.com/components/card/#card-columns" target="_blank">Link to bootstrap card-columns docs</a></p> </div><!-- /.container-fluid --> <div class="container-fluid mt-5 mb-5"> <div class="card-columns"> <!-- cards 1 to 5 --> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <!-- cards 6 to 10 --> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <!-- cards 11 to 15 --> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <!-- cards 16 to 20 --> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> <%- partial("partial/_recipe04-07-samplecard.ejs") %> </div> </div> Open the main.scss file, and comment out all the other imports since some of them clash with this recipe: @import "recipe04-04.scss"; @import "./bower_components/bootstrap/scss/bootstrap.scss"; @import "./bower_components/bootstrap/scss/_mixins.scss"; @import "./bower_components/font-awesome/scss/font-awesome.scss"; @import "./bower_components/hover/scss/hover.scss"; // @import "recipe04-01.scss"; // @import "recipe04-02.scss"; // @import "recipe04-03.scss"; // @import "recipe04-05.scss"; // @import "recipe04-06.scss"; @import "recipe04-07.scss"; // @import "recipe04-08.scss"; // @import "recipe04-09.scss"; // @import "recipe04-10.scss"; // @import "recipe04-11.scss"; // @import "recipe04-12.scss"; Next, we will add the partial file with the single card code in app/partial/_recipe04-07-samplecard.ejs: <div class="card"> <img class="card-img-top img-fluid" src="http://placehold.it/300x250" alt="Card image description"> <div class="card-block"> <h4 class="card-title">Lorem ipsum dolor sit amet.</h4> <p class="card-text">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Officia autem, placeat dolorem sed praesentium aliquid suscipit tenetur iure perspiciatis sint?</p> </div> </div> If you are serving the files on Cloud9 IDE, then reference the placehold.it images from HTTPS so you don't have the warnings appearing in the console. Open this recipe's SCSS file, titled recipe04-07.scss, and paste the following code: .card-columns { @include media-breakpoint-only(sm) { column-count: 2; } @include media-breakpoint-only(md) { column-count: 3; } @include media-breakpoint-only(lg) { column-count: 5; } @include media-breakpoint-only(xl) { column-count: 7; } } Recompile Sass and start the harp server command to view the result. How it works… In step 1, we added our recipe's structure in recipe04-07.ejs. The focus in this file is the div with the class of card-columns, which holds 20 calls to the sample card partial file. In step 2, we included the SCSS file for this recipe, and to make sure that it works, we comment out the imports for all the other recipes' SCSS files. In step 3, we made our single card, as per the Bootstrap documentation. Finally, we customized the .card-columns class in our SCSS by changing the value of the card-columns property using the media-breakpoint-only mixin. The media-breakpoint-only mixin takes the sm, md, lg, and xl values as its parameter. This allows us to easily change the value of the column-count property in our layouts.  Breakpoint-dependent switching of flex direction on card components In this recipe, we will ease into using the flexbox grid in Bootstrap 4 with a simple example of switching the flex-direction property. To achieve this effect, we will use a few helper classes to enable the use of Flexbox in our recipe. To get acquainted with the way Flexbox works in Bootstrap, check out the official documentation at https://v4-alpha.getbootstrap.com/utilities/flexbox/ . Getting ready To get started with the recipe, let's first get an idea of what we will make. Navigate to chapter8complete/app/ and run harp server. Then, preview the completed recipe at localhost:9000/recipe08-01 . You should see a simple layout with four card components lined up horizontally. Now, resize the browser, either by changing the browser's window width or by pressing F12 (which will open developer tools and allow you to narrow down the viewport by adjusting the size of developer tools). At a certain breakpoint (), you should see the cards stacked on top of one another. That is the effect that we will achieve in this recipe. How to do it… Open the folder titled chapter8/start inside source code. Open the currently empty file titled recipe08-01.ejs inside the app folder; copy the below code a it into recipe08-01.ejs: <div class="container"> <h2 class="mt-5">Recipe 08-01: Breakpoint-dependent Switching of Flex Direction on Card Components</h2> <p>In this recipe we'll switch DIRECTION, between a vertical (.flex- {breakpoint}column), and a horizontal (.flex-{breakpoint}-row) stacking of cards.</p> <p>This recipe will introduce us to the flexbox grid in Bootstrap 4.</p> </div><!-- /.container --> <div class="container"> <%- partial("partial/_card0") %> <%- partial("partial/_card0") %> <%- partial("partial/_card0") %> <%- partial("partial/_card0") %> </div> While still in the same file, find the second div with the class of container and add more classes to it, as follows: <div class="container d-flex flex-column flex-lg-row"> Now, open the app/partial folder and copy and paste the following code into the file titled _card0.ejs: <div class="p-3" id="card0"> <div class="card"> <div class="card-block"> <h3 class="card-title">Special title treatment</h3> <p class="card-text">With supporting text below as a natural lead-in to additional content.</p> <a href="#" class="btn btn-primary">Go somewhere</a> </div> </div> </div> Now, run the harp server command and preview the result at localhost:9000/recipe08-01, inside the chapter8start folder. Resize the browser window to see the stacking of card components on smaller resolutions.  How it works… To start discussing how this recipe works, let's first do a little exercise. In the file titled recipe08-01, inside the chapter8start folder, locate the first div with the container class. Add the class of d-flex to s div, so that this section of code now looks like this: <div class="container d-flex"> Save the file and refresh the page in your browser. You should see that adding the helper class of d-flex to our first container has completely changed the way that this container is displayed. What has happened is that our recipe's heading and the two paragraphs (which are all inside the first container div) are now sitting on the same flex row. The reason for this behavior is the addition of Bootstrap's utility class of d-flex, which sets our container to display: flex. With display: flex, the default behavior is to set the flex container to flex-direction: row. This flex direction is implicit, meaning that we don't have to specify it. However, if we want to specify a different value to the flex-direction property, we can use another Bootstrap 4 helper class, for example, flex-row-reverse. So, let's add it to the first div, like this: <div class="container d-flex flex-row-reverse"> Now, if we save and refresh our page, we will see that the heading and the two paragraphs still show on the flex row, but now the last paragraph comes first, on the left edge of the container. It is then followed by the first paragraph, and finally, by the heading itself. There are four ways to specify flex-direction in Bootstrap, that is, by adding one of the following four classes to our wrapping HTML element: flex-row, flex-row-reverse, flex-column, and flex-column-reverse. The first two classes align our flex items horizontally, and the last two classes align our flex items vertically. Back to our recipe, we can see that on the second container, we added the following three classes on the original div (that had only the class of container in step 1): d-flex, flex-column, and flex-lg-row.  Now we can understand what each of these classes does. The d-flex class sets our second container to display: flex. The flex-column class stacks our flex items (the four card components) vertically, with each card taking up the width of the container.  Since Bootstrap is a mobile first framework, the classes we provide also take effect mobile first. If we want to override a class, by convention, we need to provide a breakpoint at which the initial class behavior will be overridden. In this recipe, we want to specify a class, with a specific breakpoint, at which this class will make our cards line up horizontally, rather than stacking them vertically. Because of the number of cards inside our second container, and because of the minimum width that each of these cards takes up, the most obvious solution was to have the cards line up horizontally on resolutions of lg and up. That is why we provide the third class of flex-lg-row to our second container. We could have used any other helper class, such as flex-row, flex-sm-row, flex-md-row, or flex-xl-row, but the one that was actually used made the most sense. Summary In this article, we have covered Customizing the blockquote element with css, Controlling the number of card columns on different breakpoints with SCSS, and Breakpoint-dependent switching of flex direction on card components.  Resources for Article: Further resources on this subject: Web Development with React and Bootstrap [article] Gearing Up for Bootstrap 4 [article] Deep Customization of Bootstrap [article]
Read more
  • 0
  • 0
  • 23594

article-image-puppet-server-and-agents
Packt
16 Aug 2017
18 min read
Save for later

Puppet Server and Agents

Packt
16 Aug 2017
18 min read
In this article by Martin Alfke, the author of the book Puppet Essentials - Third Edition, we will following topics: The Puppet server Setting up the Puppet Agent (For more resources related to this topic, see here.) The Puppetserver Many Puppet-based workflows are centered on the server, which is the central source of configuration data and authority. The server hands instructions to all the computer systems in the infrastructure (where agents are installed). It serves multiple purposes in the distributed system of Puppet components. The server will perform the following tasks: Storing manifests and compiling catalogs Serving as the SSL certification authority Processing reports from the agent machines Gathering and storing information about the agents As such, the security of your server machine is paramount. The requirements for hardening are comparable to those of a Kerberos Key Distribution Center. During its first initialization, the Puppet server generates the CA certificate. This self-signed certificate will be distributed among and trusted by all the components of your infrastructure. This is why its private key must be protected very carefully. New agent machines request individual certificates, which are signed with the CA certificate. The terminology around the master software might be a little confusing. That's because both the terms, Puppet Master and Puppet Server, are floating around, and they are closely related too. Let's consider some technological background in order to give you a better understanding of what is what. Puppet's master service mainly comprises a RESTful HTTP API. Agents initiate the HTTPS transactions, with both sides identifying each other using trusted SSL certificates. During the time when Puppet 3 and older versions were current, the HTTPS layer was typically handled by Apache. Puppet's Ruby core was invoked through the Passenger module. This approach offered good stability and scalability. Puppet Inc. has improved upon this standard solution with a specialized software called puppetserver. The Ruby-based core of the master remains basically unchanged, although it now runs on JRuby instead of Ruby's own MRI. The HTTPS layer is run by Jetty, sharing the same Java Virtual Machine with the master. By cutting out some middlemen, puppetserver is faster and more scalable than a Passenger solution. It is also significantly easier to set up. Setting up the server machine Getting the puppetserver software onto a Linux machine is just as simple as the agent package. Packages are available on Red Hat Enterprise Linux and its derivatives,derivatives, Debian and Ubuntu and any other operating system which is supported to run a Puppet server. Until now the Puppet server must run on a Linux based Operating System and can not run on Windows or any other U*ix.U*ix. A great way to get Puppet Inc. packages on any platform is the Puppet Collection. Shortly after the release of Puppet 4, Puppet Inc. created this new way of supplying software. This can be considered as a distribution in its own right. Unlike Linux distributions, it does not contain a Kernel, system tools, and libraries. Instead, it comprises various software from the Puppet ecosystem. Software versions that are available from the same Puppet Collection are guaranteed to work well together. Use the following commands to install puppetserver from the first Puppet Collection (PC1) on a Debian 7 machine. (The Collection for Debian 8 has not yet received a puppetserver package at the time of writing this.) root@puppetmaster# wget http://apt.puppetlabs.com/puppetlabs-release-pc1-jessie.debhttp://apt.puppetlabs.com/puppetlabs-release-pc1-jessie.deb root@puppetmaster# dpkg -i puppetlabs-release-pc1-jessie.debpuppetlabs-release-pc1-jessie.deb root@puppetmaster# apt-get update root@puppetmaster# apt-get install puppetserver The puppetserver package comprises only the Jetty server and the Clojure API, but the all-in-one puppet-agent package is pulled in as a dependency. The package name, puppet-agent, is misleading. This AIO package contains all the parts of Puppet including the master core, a vendored Ruby build, and the several pieces of additional software. Specifically, you can use the puppet command on the master node. You will soon learn how this is useful. However, when using the packages from Puppet Labs, everything gets installed under /opt/puppetlabs. It is advisable to make sure that your PATH variable always includes the /opt/puppetlabs/bin directory so that the puppet command is found here. Regardless of this, once the puppetserver package is installed, you can start the master service: root@puppetmaster# systemctl start puppetserver Depending on the power of your machine, the startup can take a few minutes. Once initialization completes, the server will operate very smoothly though. As soon as the master port 8140 is open, your Puppet master is ready to serve requests. If the service fails to start, there might be an issue with certificate generation. (We observed such issues with some versions of the software.) Check the log file at /var/log/puppetlabs/puppetserver/puppetserver-daemon.log. If it indicates that there are problems while looking up its certificate file, you can work around the problem by temporarily running a standalone master as follows: puppet master –-no-daemonize. After initialization, you can stop this process. The certificate is available now, and puppetserver should now be able to start as well. Another reason for start failures is insufficient amount of memory. The Puppet server process needs 2 GB of memory. Creating the master manifest The master compiles manifests for many machines, but the agent does not get to choose which source file is to be used—this is completely at the master's discretion. The starting point for any compilation by the master is always the site manifest, which can be found in /opt/puppetlabs/code/environments/production/manifests/. Each connecting agent will use all the manifests found here. Of course, you don't want to manage only one identical set of resources on all your machines. To define a piece of manifest exclusively for a specific agent, put it in a node block. This block's contents will only be considered when the calling agent has a matching common name in its SSL certificate. You can dedicate a piece of the manifest to a machine with the name of agent, for example: node 'agent' { $packages = [ 'apache2', 'libapache2-mod-php5', 'libapache2-mod-passenger', ] package { $packages: ensure => 'installed', before => Service['apache2'], } service { 'apache2': ensure => 'running', enable => true, } } Before you set up and connect your first agent to the master, step back and think about how the master should be addressed. By default, agents will try to resolve the unqualified puppet hostname in order to get the master's address. If you have a default domain that is being searched by your machines, you can use this as a default and add a record for puppet as a subdomain (such as puppet.example.net). Otherwise, pick a domain name that seems fitting to you, such as master.example.net or adm01.example.net. What's important is the following: All your agent machines can resolve the name to an address The master process is listening for connections on that address The master uses a certificate with the chosen name as CN or DNS Alt Names The mode of resolution depends on your circumstances—the hosts file on each machine is one ubiquitous possibility. The Puppet server listens on all the available addresses by default. This leaves the task of creating a suitable certificate, which is simple. Configure the master to use the appropriate certificate name and restart the service. If the certificate does not exist yet, Puppet will take the necessary steps to create it. Put the following setting into your /etc/puppetlabs/puppet/puppet.conf file on the master machine: [main] [main] certname=puppetmaster.example.net In Puppet versions before 4.0, the default location for the configuration file is /etc/puppet/puppet.conf. Upon its next start, the master will use the appropriate certificate for all SSL connections. The automatic proliferation of SSL data is not dangerous even in an existing setup, except for the certification authority. If the master were to generate a new CA certificate at any point in time, it would break the trust of all existing agents. Make sure that the CA data is neither lost nor compromised. All previously signed certificates become obsolete whenever Puppet needs to create a new certification authority. The default storage location is /etc/puppetlabs/puppet/ssl/ca for Puppet 4.0 and higher, and /var/lib/puppet/ssl/ca for older versions. Inspecting the configuration settings All the customization of the master's parameters can be made in the puppet.conf file. The operating system packages ship with some settings that are deemed sensible by the respective maintainers. Apart from these explicit settings, Puppet relies on defaults that are either built-in or derived from the : root@puppetmaster # puppet master --configprint manifest /etc/puppetlabs/code/environments/production/manifests Most users will want to rely on these defaults for as many settings as possible. This is possible without any drawbacks because Puppet makes all settings fully transparent using the --configprint parameter. For example, you can find out where the master manifest files are located. To get an overview of all available settings and their values, use the following command: root@puppetmaster# puppet master --configprint all | less While this command is especially useful on the master side, the same introspection is available for puppet apply and puppet agent.  Setting specific configuration entries is possible with the puppet config command: root@puppetmaster # puppet config set –-section main certname puppetmaster.example.net Setting up the Puppet agent As was explained earlier, the master mainly serves instructions to agents in the form of catalogs that are compiled from the manifest. You have also prepared a node block for your first agent in the master manifest. The plain Puppet package that allows you to apply a local manifest contains all the required parts in order to operate a proper agent. If you are using Puppet Labs packages, you need not install the puppetserver package. Just get puppet-agent instead. After a successful package installation, one needs to specify where Puppet agent can find the puppetserver: root@puppetmaster # puppet config set –-section agent server puppetmaster.example.net Afterwards the following invocation is sufficient for an initial test: root@agent# puppet agent --test Info: Creating a new SSL key for agent Error: Could not request certificate: getaddrinfo: Name or service not known Exiting; failed to retrieve certificate and waitforcert is disabled Puppet first created a new SSL certificate key for itself. For its own name, it picked agent, which is the machine's hostname. That's fine for now. An error occurred because the puppet name cannot be currently resolved to anything. Add this to /etc/hosts so that Puppet can contact the master: root@agent# puppet agent --test Info: Caching certificate for ca Info: csr_attributes file loading from /etc/puppetlabs/puppet/csr_attributes.yaml Info: Creating a new SSL certificate request for agent Info: Certificate Request fingerprint (SHA256): 52:65:AE:24:5E:2A:C6:17:E2:5D:0A:C9: 86:E3:52:44:A2:EC:55:AE:3D:40:A9:F6:E1:28:31:50:FC:8E:80:69 Error: Could not request certificate: Error 500 on SERVER: Internal Server Error: java.io.FileNotFoundException: /etc/puppetlabs/puppet/ssl/ca/requests/agent.pem (Permission denied) Exiting; failed to retrieve certificate and waitforcert is disabled Note how Puppet conveniently downloaded and cached the CA certificate. The agent will establish trust based on this certificate from now on. Puppet created a certificate request and sent it to the master. It then immediately tried to download the signed certificate. This is expected to fail—the master won't just sign a certificate for any request it receives. This behavior is important for proper security. There is a configuration setting that enables such automatic signing, but users are generally discouraged from using this setting because it allows the creation of arbitrary numbers of signed (and therefore, trusted) certificates to any user who has network access to the master. To authorize the agent, look for the CSR on the master using the puppet cert command: root@puppetmaster# puppet cert --list "agent" (SHA256) 52:65:AE:24:5E:2A:C6:17:E2:5D:0A:C9:86:E3:52:44:A2:EC:55:AE: 3D:40:A9:F6:E1:28:31:50:FC:8E:80:69 This looks alright, so now you can sign a new certificate for the agent: root@puppetmaster# puppet cert --sign agent Notice: Signed certificate request for agent Notice: Removing file Puppet::SSL::CertificateRequest agent at '/etc/puppetlabs/ puppet/ssl/ca/requests/agent.pem' When choosing the action for puppet cert, the dashes in front of the option name can be omitted—you can just use puppet cert list and puppet cert sign. Now the agent can receive its certificate for its catalog run as follows: root@agent# puppet agent --test Info: Caching certificate for agent Info: Caching certificate_revocation_list for ca Info: Caching certificate for agent Info: Retrieving pluginfacts Info: Retrieving plugin Info: Caching catalog for agent Info: Applying configuration version '1437065761' Notice: Applied catalog in 0.11 seconds The agent is now fully operational. It received a catalog and applied all resources within. Before you read on to learn how the agent usually operates, there is a note that is especially important for the users of Puppet 3. Since this is the common name in the master's certificate, the preceding command will not even work with a Puppet 3.x master. It works with puppetserver and Puppet 4 because the default puppet name is now included in the certificate's Subject Alternative Names by default. It is tidier to not rely on this alias name, though. After all, in production, you will probably want to make sure that the master has a fully qualified name that can be resolved, at least inside your network. You should therefore, add the following to the main section of puppet.conf on each agent machine: [agent] [agent] server=master.example.net In the absence of DNS to resolve this name, your agent will need an appropriate entry in its hosts file or a similar alternative way of address resolution. These steps are necessary in a Puppet 3.x setup. If you have been following along with a Puppet 4 agent, you might notice that after this change, it generates a new Certificate Signing Request: root@agent# puppet agent –test Info: Creating a new SSL key for agent.example.net Info: csr_attributes file loading from /etc/puppetlabs/puppet/csr_attributes.yaml Info: Creating a new SSL certificate request for agent.example.net Info: Certificate Request fingerprint (SHA256): 85:AC:3E:D7:6E:16:62:BD:28:15:B6:18: 12:8E:5D:1C:4E:DE:DF:C3:4E:8F:3E:20:78:1B:79:47:AE:36:98:FD Exiting; no certificate found and waitforcert is disabled If this happens, you will have to use puppet cert sign on the master again.  The agent will then retrieve a new certificate. The agent's life cycle In a Puppet-centric workflow, you typically want all changes to the configuration of servers (perhaps even workstations) to originate on the Puppet master and propagate to the agents automatically. Each new machine gets integrated into the Puppet infrastructure with the master at its center and gets removed during the decommissioning, as shown in the following diagram: Insert image 6566_02_01.png The very first step—generating a key and a certificate signing request—is always performed implicitly and automatically at the start of an agent run if no local SSL data exists yet. Puppet creates the required data if no appropriate files are found. There will be a short description on how to trigger this behavior manually later in this section. The next step is usually the signing of the agent's certificate, which is performed on the master. It is a good practice to monitor the pending requests by listing them on the console: root@puppetmaster# puppet cert list oot@puppetmaster# puppet cert sign '<agent fqdn>' From this point on, the agent will periodically check with the master to load updated catalogs. The default interval for this is 30 minutes. The agent will perform a run of a catalog each time and check the sync state of all the contained resources. The run is performed for unchanged catalogs as well, because the sync states can change between runs. Before you manage to sign the certificate, the agent process will query the master in short intervals for a while. This can avoid a 30 minute delay if the certificate is not ready, right when the agent starts up. Launching this background process can be done manually through a simple command: root@agent# puppet agent However, it is preferable to do this through the puppet system service. When an agent machine is taken out of active service, its certificate should be invalidated. As is customary with SSL, this is done through revocation. The master adds the serial number of the certificate to its certificate revocation list. This list, too, is shared with each agent machine. Revocation is initiated on the master through the puppet cert command: root@puppetmaster# puppet cert revoke agent The updated CRL is not honored until the master service is restarted. If security is a concern, this step must not be postponed. The agent can then no longer use its old certificate: root@agent# puppet agent --test Warning: Unable to fetch my node definition, but the agent run will continue: Warning: SSL_connect SYSCALL returned=5 errno=0 state=unknown state [...] Error: Could not retrieve catalog from remote server: SSL_connect SYSCALL returned=5 errno=0 state=unknown state [...] Renewing an agent's certificate Sometimes, it is necessary during an agent machine's life cycle to regenerate its certificate and related data—the reasons can include data loss, human error, or certificate expiration, among others. Performing the regeneration is quite simple: all relevant files are kept at /etc/puppetlabs/puppet/ssl (for Puppet 3.x, this is /var/lib/puppet/ssl) on the agent machine. Once these files are removed (or rather, the whole ssl/ directory tree), Puppet will renew everything on the next agent run. Of course, a new certificate must be signed. This requires some preparation—just initiating the request from the agent will fail: root@agent# puppet agent –test Info: Creating a new SSL key for agent Info: Caching certificate for ca Info: Caching certificate for agent.example.net Error: Could not request certificate: The certificate retrievedfrom the master does not match the agent's private key. Certificate fingerprint: 6A:9F:12:C8:75:C0:B6:10:45:ED:C3:97:24:CC:98:F2:B6:1A:B5: 4C:E3:98:96:4F:DA:CD:5B:59:E0:7F:F5:E6 The master still has the old certificate cached. This is a simple protection against the impersonation of your agents by unauthorized entities. To fix this, remove the certificate from both the master and the agent and then start a Puppet run, which will automatically regenerate a certificate. On the master: puppet cert clean agent.example.net On the agent: On most platforms: find /etc/puppetlabs/puppet/ssl -name agent.example.net.pem –delete On Windows: del "/etc/puppetlabs/puppet/ssl/agent.example.net.pem" /f puppet agent –t Exiting; failed to retrieve certificate and waitforcert is disabled Once you perform the cleanup operation on the master, as advised in the preceding output, and remove the indicated file from the agent machine, the agent will be able to successfully place its new CSR: root@puppetmaster# puppet cert clean agent Notice: Revoked certificate with serial 18 Notice: Removing file Puppet::SSL::Certificate agent at '/etc/puppetlabs/ puppet/ssl/ca/signed/agent.pem' Removing file Puppet::SSL::Certificate agent at '/etc/puppetlabs/ puppet/ssl/certs/agent.pem'. The rest of the process is identical to the original certificate creation. The agent uploads its CSR to the master, where the certificate is created through the puppet cert sign command. Running the agent from cron There is an alternative way to operate the agent. We covered starting one long-running puppet agent process that does its work in set intervals and then goes back to sleep. However, it is also possible to have cron launch a discrete agent process in the same interval. This agent will contact the master once, run the received catalog, and then terminate. This has several advantages as follows: The agent operating system saves some resources The interval is precise and not subject to skew (when running the background agent, deviations result from the time that elapses during the catalog run), and distributed interval skew can lead to thundering herd effects Any agent crash or an inadvertent termination is not fatal Setting Puppet to run the agent from cron is also very easy to do—with Puppet! You can use a manifest such as the following: service { 'puppet': enable => false } cron { 'puppet-agent-run': user => 'root', command => 'puppet agent --no-daemonize --onetime --logdest=syslog', minute => fqdn_rand(60), hour => absent, } The fqdn_rand function computes a distinct minute for each of your agents. Setting the hour property to absent means that the job should run every hour. Summary In this article we have learned about the Puppet server and also learned how to setting up the Puppet agent.  Resources for Article: Further resources on this subject: Quick start – Using the core Puppet resource types [article] Understanding the Puppet Resources [article] Puppet: Integrating External Tools [article]
Read more
  • 0
  • 0
  • 36194

article-image-machine-learning-models
Packt
16 Aug 2017
8 min read
Save for later

Machine Learning Models

Packt
16 Aug 2017
8 min read
In this article by Pratap Dangeti, the author of the book Statistics for Machine Learning, we will take a look at ridge regression and lasso regression in machine learning. (For more resources related to this topic, see here.) Ridge regression and lasso regression In linear regression only residual sum of squares (RSS) are minimized, whereas in ridge and lasso regression, penalty applied (also known as shrinkage penalty) on coefficient values to regularize the coefficients with the tuning parameter λ. When λ=0 penalty has no impact, ridge/lasso produces the same result as linear regression, whereas λ => ∞ will bring coefficients to zero. Before we go in deeper on ridge and lasso, it is worth to understand some concepts on Lagrangian multipliers. One can show the preceding objective function into the following format, where objective is just RSS subjected to cost constraint (s) of budget. For every value of λ, there is some s such that will provide the equivalent equations as shown as follows for overall objective function with penalty factor: The following graph shows the two different Lagrangian format: Ridge regression works well in situations where the least squares estimates have high variance. Ridge regression has computational advantages over best subset selection which required 2P models. In contrast for any fixed value of λ, ridge regression only fits a single model and model-fitting procedure can be performed very quickly. One disadvantage of ridge regression is, it will include all the predictors and shrinks the weights according with its importance but it does not set the values exactly to zero in order to eliminate unnecessary predictors from models, this issue will be overcome in lasso regression. During the situation of number of predictors are significantly large, using ridge may provide good accuracy but it includes all the variables, which is not desired in compact representation of the model, this issue do not present in lasso as it will set the weights of unnecessary variables to zero. Model generated from lasso are very much like subset selection, hence it is much easier to interpret than those produced by ridge regression. Example of ridge regression machine learning model Ridge regression is machine learning model, in which we do not perform any statistical diagnostics on the independent variables and just utilize the model to fit on test data and check the accuracy of fit. Here we have used scikit-learn package: >>> from sklearn.linear_model import Ridge >>> wine_quality = pd.read_csv( >>> wine_quality.rename(columns=lambda x: x.replace(" ", inplace=True) >>> all_colnms = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', Article_01.png λ, ridge regression only fits a idge exactly to zero in egression. are significantly large, idge variables, which is not model, this issue do not present in lasso as it will e regression machine learning model learning model, in which we do not perform any statistical the model to fit on test data and have used scikit-learn package: csv("winequality-red.csv",sep=';') "_"), in ariables, asso odel earning el 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] >>> pdx = wine_quality[all_colnms] >>> pdy = wine_quality["quality"] >>> x_train,x_test,y_train,y_test = train_test_split(pdx,pdy,train_size = 0.7,random_state=42) Simple version of grid search from scratch has been described as follows, in which various values of alphas are tried to be tested in grid search to test the model fitness: >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] Initial values of R-squared are set to zero in order to keep track on its updated value and to print whenever new value is greater than exiting value: >>> initrsq = 0 >>> print ("nRidge Regression: Best Parametersn") >>> for alph in alphas: ... ridge_reg = Ridge(alpha=alph) ... ridge_reg.fit(x_train,y_train) 0 ... tr_rsqrd = ridge_reg.score(x_train,y_train) ... ts_rsqrd = ridge_reg.score(x_test,y_test) The following code always keep track on test R-squared value and prints if new value is greater than existing best value: >>> if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is is shown in the following screenshot: By looking into test R-squared (0.3513) value we can conclude that there is no significant relationship between independent and dependent variables. Also, please note that, the test R-squared value generated from ridge regression is similar to value obtained from multiple linear regression (0.3519), but with the no stress on diagnostics of variables, and so on. Hence machine learning models are relatively compact and can be utilized for learning automatically without manual intervention to retrain the model, this is one of the biggest advantages of using ML models for deployment purposes. The R code for ridge regression on wine quality data is shown as follows: # Ridge regression library(glmnet) wine_quality = read.csv("winequality-red.csv",header=TRUE,sep = ";",check.names = FALSE) names(wine_quality) <- gsub(" ", "_", names(wine_quality)) set.seed(123) numrow = nrow(wine_quality) trnind = sample(1:numrow,size = as.integer(0.7*numrow)) train_data = wine_quality[trnind,]; test_data = wine_quality[- trnind,] xvars = c("fixed_acidity","volatile_acidity","citric_acid","residual_sugar ","chlorides","free_sulfur_dioxide", "total_sulfur_dioxide","density","pH","sulphates","alcohol") yvar = "quality" x_train = as.matrix(train_data[,xvars]);y_train = as.double (as.matrix (train_data[,yvar])) x_test = as.matrix(test_data[,xvars]) print(paste("Ridge Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ ridge_fit = glmnet(x_train,y_train,alpha = 0,lambda = lmbd) pred_y = predict(ridge_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Example of lasso regression model Lasso regression is close cousin of ridge regression, in which absolute values of coefficients are minimized rather than square of values. By doing so, we eliminate some insignificant variables, which are very much compacted representation similar to OLS methods. Following implementation is almost similar to ridge regression apart from penalty application on mod/absolute value of coefficients: >>> from sklearn.linear_model import Lasso >>> alphas = [1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0] >>> initrsq = 0 >>> print ("nLasso Regression: Best Parametersn") >>> for alph in alphas: ... lasso_reg = Lasso(alpha=alph) ... lasso_reg.fit(x_train,y_train) ... tr_rsqrd = lasso_reg.score(x_train,y_train) ... ts_rsqrd = lasso_reg.score(x_test,y_test) ... if ts_rsqrd > initrsq: ... print ("Lambda: ",alph,"Train R-Squared value:",round(tr_rsqrd,5),"Test R-squared value:",round(ts_rsqrd,5)) ... initrsq = ts_rsqrd It is shown in the following screenshot: Lasso regression produces almost similar results as ridge, but if we check the test R-squared values bit carefully, lasso produces little less values. Reason behind the same could be due to its robustness of reducing coefficients to zero and eliminate them from analysis: >>> ridge_reg = Ridge(alpha=0.001) >>> ridge_reg.fit(x_train,y_train) >>> print ("nRidge Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",ridge_reg.coef_[i]) >>> lasso_reg = Lasso(alpha=0.001) >>> lasso_reg.fit(x_train,y_train) >>> print ("nLasso Regression coefficient values of Alpha = 0.001n") >>> for i in range(11): ... print (all_colnms[i],": ",lasso_reg.coef_[i]) Following results shows the coefficient values of both the methods, coefficient of density has been set to o in lasso regression whereas density value is -5.5672 in ridge regression; also none of the coefficients in ridge regression are zero values: R Code – Lasso Regression on Wine Quality Data # Above Data processing steps are same as Ridge Regression, only below section of the code do change # Lasso Regression print(paste("Lasso Regression")) lambdas = c(1e-4,1e-3,1e-2,0.1,0.5,1.0,5.0,10.0) initrsq = 0 for (lmbd in lambdas){ lasso_fit = glmnet(x_train,y_train,alpha = 1,lambda = lmbd) pred_y = predict(lasso_fit,x_test) R2 <- 1 - (sum((test_data[,yvar]-pred_y )^2)/sum((test_data[,yvar]-mean(test_data[,yvar]))^2)) if (R2 > initrsq){ print(paste("Lambda:",lmbd,"Test Adjusted R-squared :",round(R2,4))) initrsq = R2 } } Regularization parameters in linear regression and ridge/lasso regression Adjusted R-squared in linear regression always penalizes adding extra variables with less significance is one type of regularizing the data in linear regression, but it will adjust to unique fit of the model. Whereas in machine learning many parameters are adjusted to regularizing the overfitting problem, in the example of lasso/ridge regression penalty parameter (λ) to regularization, there are infinite values can be applied to regularize the model infinite ways: In overall there are many similarities between statistical way and machine learning ways of predicting the pattern. Summary We have seen ridge regression and lasso regression with their examples and we have also seen its regularization parameters. Resources for Article: Further resources on this subject: Machine Learning Review [article] Getting Started with Python and Machine Learning [article] Machine learning in practice [article]
Read more
  • 0
  • 0
  • 3492
article-image-cloud-and-devops-revolution
Packt
14 Aug 2017
12 min read
Save for later

The Cloud and the DevOps Revolution

Packt
14 Aug 2017
12 min read
Cloud and DevOps are two of the most important trends to emerge in technology. The reasons are clear - it's all about the amount of data that needs to be processed and managed in the applications and websites we use every day. The amount of data being processed and handled is huge. Every day, over a billion people visit Facebook, every hour 18,000 hours of video are uploaded to YouTube, every second Google process 40,000 search queries. Being able to handle such a staggering scale isn't easy. Through the use of the Amazon Web Services (AWS), you will be able to build out the key components needed to succeed at minimum cost and effort. This is an extract from Effective DevOps on AWS. Thinking in terms of cloud and not infrastructure The day I discovered that noise can damage hard drives. December 2011, sometime between Christmas and new year's eve. I started to receive dozens of alerts from our monitoring system. Apparently we had just lost connectivity to our European datacenter in Luxembourg. I rushed into network operating center (NOC) hopping that it's only a small glitch in our monitoring system, maybe just a joke after all, with so much redundancy how can everything go offline? Unfortunately, when I got into the room, the big monitoring monitors were all red, not a good sign. This was just the beginning of a very long nightmare. An electrician working in our datacenter mistakenly triggered the fire alarm, within seconds the fire suppression system set off and released its aragonite on top of our server racks. Unfortunately, this kind of fire suppression system makes so much noise when it releases its gas that sound wave instantly killed hundreds and hundreds of hard drives effectively shutting down our only European facility. It took months for us to be back on our feet. Where is the cloud when you need it! As Charles Philips said it the best: "Friends don't let friends build a datacenter." Deploying your own hardware versus in the cloud It wasn't long ago that tech companies small and large had to have a proper technical operations organization able to build out infrastructures. The process went a little bit like this: Fly to the location you want to put your infrastructure in, go tour the the different datacenters and their facilities. Look at the floor considerations, power considerations, HVAC, fire prevention systems, physical security, and so on. Shop for an internet provider, ultimately even so you are talking about servers and a lot more bandwidth, the process is the same, you want to get internet connectivity for your servers. Once that's done, it's time to get your hardware. Make the right decisions because you are probably going to spend a big portion of your company money on buying servers, switches, routers, firewall, storage, UPS (for when you have a power outage), kvm, network cables, the dear to every system administrator heart, labeler and a bunch of spare parts, hard drives, raid controllers, memory, power cable, you name it. At that point, once the hardware is bought and shipped to the datacenter location, you can rack everything, wire all the servers, power everything.Your network team can kick in and start establishing connectivity to the new datacenter using various links, configuring the edge routers, switches, top of the racks switches, kvm, firewalls (sometime), your storage team is next and is going to provide the much needed NAS or SAN, next come your sysops team who will image the servers, sometime upgrade the bios and configure hardware raid and finally put an OS on those servers. Not only this is a full-time job for a big team, it also takes lots of time and money to even get there. Getting new servers up and running with AWS will take us minutes.In fact, more than just providing a server within minutes, we will soon see how to deploy and run a service in minutes and just when you need it. Cost Analysis From a cost stand point, deploying in a cloud infrastructure such as AWSusually end up being a to a lot cheaper than buying your own hardware. If you want to deploy your own hardware, you have to pay upfront for all the hardware (servers, network equipment) and sometime license software as well. In a cloud environment you pay as you go. You can add and remove servers in no time. Also, if you take advantage of the PaaS and SaaS applications you also usually end up saving even more money by lowering your operating costs as you don't need as much staff to administrate your database, storage, and so on. Most cloud providers,AWS included, also offer tired pricing and volume discount. As your service gets bigger and bigger, you end up paying less for each unit of storage, bandwidth, and so on. Just on time infrastructure As we just saw, when deploying in the cloud,you pay as you go. Most cloud companies use that to their advantage to scale up and down their infrastructure as the traffic to their sites changes. This ability to add and remove new servers and services in no time and on demand is one of the main differentiator of an effective cloud infrastructure. Here is a diagram from a presentation from 2015 that shows the annual traffic going to https://www.amazon.com/(the online store): © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. As you can see, with the holidays, the end of the year is a busy time for https://www.amazon.com/, their traffic triple. If they were hosting their service in an "old fashion" way, they would have only 24% of their infrastructure used in average every year but thanks to being able to scale dynamically they are able to only provision what they really need. © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. Here at medium, we also see on a very regular basis the benefits from having fast auto scaling capabilities. Very often, stories will become viral and the amount of traffic going on medium can drastically change. On January 21st 2015, to our surprise, the White House posted the transcript of the State of the Union minutes before President Obama started his speech: https://medium.com/@WhiteHouse/ As you can see in the following graph, thanks to being in the cloud and having auto scaling capabilities, our platform was able to absorb the 5xinstant spike of traffic that the announcement made by doubling the amount of servers our front service uses. Later as the traffic started to naturally drain, we automatically removed some hosts from our fleet. The different layer of building a cloud The cloud computing is often broken up into 3 different type of service: Infrastructure as a Service (IaaS): It is the fundamental block on top of which everything cloud is built upon. It is usually a computing resource in a virtualized environment. It offers a combination of processing power, memory, storage, and network. The most common IaaS entities you will find are virtual machines (VM), network equipment like load balancers or virtual Ethernet interface and storage like block devices.This layer is very close to the hardware and give you the full flexibility that you would get deploying your software outside of a cloud. If you have any datacentre physical knowledge, this will mostly also apply to that layer. Platform as a Service (PaaS):It is where things start to get really interesting with the cloud. When building an application, you will likely need a certain number of common components such as a data store, a queue, and so on. The PaaS layer provides a number of ready to use applications to help you build your own services without worrying about administrating and operating those 3rd party services such as a database server. Sofware as a Service (SaaS):It is the icing on the cake. Similarly, to the PaaS layer you get access to managed services but this time those services are complete solution dedicated to certain purpose such as management or monitoring tools. When building an application, relying on those services make a big difference when compared to more traditional environment outside of a cloud. Another key element to succeed when deploying or migrating to a new infrastructure is to adopt a DevOps mind-set. Deploying in AWS AWS is on the forefront of the cloud providers. Launched in 2006 with SQS and EC2, Amazon quickly became the biggest IaaS provider. They have the biggest infrastructure, the biggest ecosystem and constantly add new feature and release new services. In 2015 they passed the cap of 1 million active customers. Over the last few years, they managed to change people's mind set about cloud and now deploying new services to the cloud is the new normal. Using the AWS managed tools and services is a drastic way to improve your productivity and keep your team lean. Amazon is continually listening to its customer's feedback and looking at the market trends therefore, as the DevOps movement started to get established, Amazon released a number of new services tailored toward implementing some of the DevOps best practices.We will also see how those services synergize with the DevOps culture. How to take advantage of the AWS ecosystem When you talk to applications architects, there are usually two train of train of thought. The first one is to stay as platform agnostic as possible. The idea behind that is that if you aren't happy with AWS anymore, you can easily switch cloud provider or even build your own private cloud. The 2nd train of thought is the complete opposite; the idea is that you are going to stick to AWS no matter what. It feels a bit extreme to think it that way but the reward is worth the risk and more and more companies agree with that. That's also where I stand. When you build a product now a day, the scarcity is always time and people. If you can outsource what is not your core business to a company that provides similar service or technology, support expertise and that you can just pay for it on SaaS model, then do so. If, like me you agree that using managed services is the way to go then being a cloud architect is like playing with Lego. With Lego, you have lots of pieces of different shapes, sizes, and colors and you assemble them to build your own MOC. Amazon services are like those Lego pieces. If you can picture your final product, then you can explore the different services and start combining them to build the supporting stack needed to quickly and efficiently build your product. Of course, in this case, the "If" is a big if and unlike Lego, understanding what each piece can do is a lot less visual and colorful than Lego pieces. How AWS synergize with the DevOps culture Having a DevOps culture is about rethinking how engineering teams work together by breaking out those developers and operations silos and bringing a new set of new tools to implement some best practices. AWS helps in many different accomplish that: For some developers, the world of operations can be scary and confusing but if you want better cooperation between engineers, it is important to expose every aspect of running a service to the entire engineering organization. As an operations engineer, you can't have a gate keeper mentality toward developers, instead it's better to make them comfortable accessing production and working on the different component of the platform. A good way to get started with that in the AWS console. While a bit overwarming, it is still a much better experience for people not familiar with this world to navigate that web interface than referring to constantly out of date documentations, using SSH and random plays to discover the topology and configuration of the service. Of course, as your expertise grows, as your application becomes and more complex and the need to operate it faster increase, the web interface starts to showing some weakness. To go around that issue, AWS provides a very DevOPS friendly alternative: an API. Accessible through a command-line tool and a number of SDK (which include Java, Javascript, Python, .net, php, ruby go, and c++) the SDKs let you administrate and use the managed services. Finally, AWS offers a number of DevOps tools. AWS has a source control service similar to GitHub called CodeCommit. For automation, in addition to allowing to control everything via SDKs, AWSprovides the ability to create template of your infrastructure via CloudFormation but also a configuration management system called OpsWork. It also knows how to scale up and down fleets of servers using Auto Scaling Groups. For the continuous delivery AWS provide a service called CodePipeline and for continuous deployment a service called CodeCommit. With regards to measuring everything, we will rely on CloudWatch and later ElasticSearch / Kibanato visualize metrics and logs. Finally, we will see how to use Docker via ECS which will let us create containers to improve the server density (we will be able to reduce the VM consumption as we will be able to collocate services together in 1 VM while still keeping fairly good isolation), improve the developer environment as we will now be able to run something closer to the production environment and improve testing time as starting containers is a lot faster than starting virtual machines.
Read more
  • 0
  • 0
  • 35554

article-image-getting-best-out-onedrive-business
Packt
14 Aug 2017
15 min read
Save for later

Getting the best out of OneDrive for Business

Packt
14 Aug 2017
15 min read
In this article by Prashant G Bhoyar and Martin Machado, authors of the book, PowerShell for Office 365, you will learn about OneDrive which is another key component for the digital workplace. Individuals can safely keep/sync content on the cloud, making it available anywhere while businesses can manage and retain content securely. In this article, we'll go over common provisioning and management scenarios. (For more resources related to this topic, see here.) In the early days, SharePoint was positioned as a great replacement for file shares. SharePoint addressed some important pain points of file shares: versioning, recycle bin, check in/check out, history/auditing, web interface, and custom metadata features, to name a few. Fast forward to the present, SharePoint and other content management system products have effectively replaced file shares in the collaboration space. Yet, file shares still remain very relevant to personal storage. Albeit you would hardly qualify OneDrive for business as a file share (at least not the one from 10 years ago). Officially defined as file-hosting products, OneDrive and OneDrive for business still offer the convenience of operating system integration of file shares while adopting important features from the CMS world. Recently, Microsoft has also rolled out OneDrive for Office Groups, making the case for small group collaboration through OneDrive. Why start with SharePoint in the article on OneDrive, you ask? I am glad you did. At the time of writing this, there are a few differences between OneDrive and SharePoint. All the OneDrive administration commands are included within the SPO API. OneDrive's web interface is a trimmed down SharePoint site, and you can use the same SharePoint CSOM/REST APIs to work with OneDrive. From an administrator's perspective, OneDrive can be thought of as a synchronization client (in charge of keeping data consistent between local copies and online storage) and a web interface (branded and customized SharePoint site). Will this be the case in the long run? At the moment, we are going through a transition period. From the writer's point of view, SharePoint will continue to provide infrastructure for OneDrive and other services. However, Microsoft is in an ongoing effort to provide one API for all its online services. Also, as the platform matures, the lines between OneDrive, SharePoint, Exchange and other services seem to blur more and more. In the long run, it is quite possible that these products will merge or change in ways we have not thought of. With the maturity of the Microsoft Graph API (the promised API to access all your services), the internal implementation of the services will be less important for developers and administrators. In the Graph API, both OneDrive and SharePoint document libraries are referred to as 'drives' and files or list items within them as 'driveitems'. This is an indication that even though change is certain, both feature sets will remain similar. In this article, we will cover OneDrive administration, which can be divided into three different areas: Feature configuration Personal site management Data migration Feature configuration The following are properties of the Set-SPOTenant command that can be used to configure the OneDrive user experience: OneDriveStorageQuota: By default, OneDrive's storage quota is set to 1 TB. The policy value can be changed through the Set-SPOTenant command, and the existing sites quotas can be changed through the Set-SPOSite command. This value is set in megabytes (1048576 for 1 TB) and will be capped by the user's assigned license. In the following example, we change the quota policy to 6 TB, but the value is effectively set at 5 TB as it is the highest value allowed for standard licenses: $quota = 6TB / 1024 / 1024 Set-SPOTenant -OneDriveStorageQuota $quota Get-SPOTenant | Select OneDriveStorageQuota OneDriveStorageQuota -------------------- 5242880 Individual site quotas can be reviewed and updated using the Get-SPOSite and Set-SPOSite commands.In the following sample, note that after updating the quotas for the individual sites, we have to use Get-SPOSite to see the updated values (changes to sites will not be updated in local variables): $mySites = Get-SPOSite -IncludePersonalSite $true -Filter { Url -like '/personal/'} $mySites | Select StorageQuota, StorageUsageCurrent StorageQuota StorageUsageCurrent ------------ ------------------- 1048576 6 1048576 1 5242880 15 $quota = 3TB / 1024 / 1024 foreach ($site in $mySites) { Set-SPOSite -Identity $site -StorageQuota $quota } $mySites = Get-SPOSite -IncludePersonalSite $true -Filter { Url -like '/personal/'} $mySites | Select StorageQuota StorageQuota ------------ 31457286 31457286 31457286 NotifyOwnersWhenInvitationsAccepted: When set to true, the OneDrive owner will be notified when an external user accepts an invitation to access a file or folder. NotifyOwnersWhenItemsReshared : When set to true, the OneDrive owner will be notified when a file or folder is shared by another user. OrphanedPersonalSitesRetentionPeriod: When a user is deleted, the OneDrive will be retained by a default of 30 days after the threshold the site will be deleted (value in days from 30 to 3650). ProvisionSharedWithEveryoneFolder: If set to true, a public folder will be set up when a OneDrive site is set up. The 'Shared with Everyone' folder is not accessible through the OneDrive client, but it can be used through the browser and is accessible by all users. SpecialCharactersStateInFileFolderNames: Allows the use of special characters in files and folders (applies to both SharePoint and OneDrive). Currently, the only special characters that can be allowed are # and %. Microsoft has announced that support for additional special characters will be rolled out soon. Client Synchronization: Synchronization Set-SPOTenantSyncClientRestriction Personal Site management Historically, personal sites (or My Sites) have been a management problem. When planning a deployment, you have to consider your user base, the turnover in your organization, the internal policy for content storage, and many other factors. In Office 365, some of these factors have been addressed, but largely, the MySites deployment (as well as any other large-scale site deployment) remains a usage problem. With the introduction of quotas, you can cap both storage and resources allocated for a site. By default, MySites get 1 TB of space; unfortunately, the quotas cannot be set in the Request-SPOPersonalSite command, which is used for the provisioning of personal sites. Another issue with personal sites is that it takes a few minutes to set them up. It is very common that an administrator will pre-provision personal sites for the organization. At the time of writing this, One Drive is implemented as Personal Sites, which means that the scripts we will review also apply to provisioning OneDrive. This is a very common task for migrations to the cloud: Request-SPOPersonalSite -UserEmails <String[]> [-NoWait <SwitchParameter>] The Request-SPOPersonalSite command has only two parameters, yet its usage is worth documenting due to some common issues. If you are deploying for a small list of users, an inline array of strings will schedule the creation of the sites. It is worth noting that the command will not return errors if the users are not found or if the user count exceeds 200 items. In general, you will have to validate that the process has completed: Request-SPOPersonalSite -UserEmails 'test2@mytest321.onmicrosoft.com', 'admin1@mytest321.onmicrosoft.com' -NoWait $true It is very common that the list of users will be read from a file or a csv input. In the following example, we parse a comma-separated list of e-mails using Split. Even though the documentation specifies an array of strings, this call will not work unless we transform the string array into an object array through the use of the Where command: Request-SPOPersonalSite -UserEmails ('test2@mytest321.onmicrosoft.com,admin1@mytest321.onmicrosoft.com'.Split(' ,')| Where-Object {$true}) Another common scenario is to deploy personal sites for a list of users already in SharePoint Online. The following script will retrieve all users with a valid login (login in the form of an e-mail). Note the use of the ExpandProperty parameter to return just the LoginName property of the users: $users = Get-SPOUser -Site https://mytest321.sharepoint.com | Where-Object { $_.IsGroup -ne $true -and $_.LoginName -like '*@*.*'} | Select-Object -ExpandProperty LoginName; If the list is small, we can iterate over the list of users or schedule the provisioning in one call. It is safe to schedule the personal site for a user that already has one (will be silently skipped), but there will be no warning when submitting over 200 requests: #indivudal request $users | ForEach-Object { Request-SPOPersonalSite -UserEmails $_ } #bulk Request-SPOPersonalSite -UserEmails $users If you are dealing with many users, you can create groups of 200 items instead and submit them in bulk: # Group by requests of 200 emails ----------------------------------------- -------------------- $groups = $users | Group-Object {[int]($users.IndexOf($_)/200)} # send requests in 200 batches, do no wait for a response $groups | ForEach-Object { $logins = $_.Group; Write-Host 'Creating sites for: '$logins Request-SPOPersonalSite -NoWait -UserEmails $logins } It is up to the administrator to verify the completion of the request or if any of the users were not found. To complete the scenario, the following script will select and delete all personal sites: $mySites = Get-SPOSite -IncludePersonalSite $true -Filter { Url -like '/personal/'} $mySites | Remove-SPOSite -Confirm:$false To be able to access and manage OneDrives, administrators need to be site collection administrators of the OneDrive (remember that it is a SharePoint site). The SharePoint Tenant administration site has an option to add a secondary administrator when sites are provisioned, but this setting will not apply to sites that are already created. In the following script, we add an additional site collection administrator to all existing OneDrives: $mySites = Get-SPOSite -IncludePersonalSite $true -Filter { Url -like '/personal/'} foreach ($site in $mySites) { Set-SPOUser -Site $site -LoginName admin@mytest321.onmicrosoft.com - IsSiteCollectionAdmin $true } Data migration The last topic concerning site collections is document migrations. All the content covered in this article also applies to SharePoint sites. There are three alternative methods to upload data in Office 365: The CSOM API The SPO API Office 365 Import Service Let's look at each one in detail. CSOM API Initially, the CSOM API was the only method available to upload documents to SharePoint Online. CSOM is a comprehensive API that is used for application development and administration. It is a great tool for a myriad scenarios, but it is not specialized for content migrations. When used for this purpose, we can go over the API throttling limits (Microsoft has purposely not put a specific number to this as it depends on multiple factors). Your scripts might get temporarily blocked (requests will get a 429 'Too Many Requests' HTTP error), and if the misuse continues for an extended period of time, your tenant might get blocked altogether (503 'Service Unavailable'). The tenant administrator would have to take action in this case. API throttling is put in place to guarantee platform health. The patterns and practices Throttling project shows how to work around this limitation for legitimate scenarios at https://github.com/SharePoint/PnP/tree/dev/Samples/Core.Throttling. Moreover, the bandwidth allocated for the CSOM API will allow you to upload approximately 1 Gb/hour only (depending on multiple factors such as the file size, the number of files, networking, and concurrent API usage), which makes it impractical for large content migrations. In the next sections, you will see faster and easier approaches to bulk migrations, yet the CSOM API remains relevant in this scenario. This is because at the time of writing this, it is the only method that allows metadata modification. It is also worth mentioning that CSOM changes are reflected immediately, whereas updates through the other methods will take some time to be effective due to the architecture of the process. In our experience doing content migrations, the bulk of the tasks are done with the SPO API, yet CSOM is better suited for last minute changes or ad-hoc requests. The following sample shows how to upload a file and set its metadata. This method will be used for small migrations or to set the file metadata: $siteUrl = "https://mytest321.sharepoint.com/personal/admin1"; $clientContext = New-Object Microsoft.SharePoint.Client.ClientContext($siteUrl) $credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($spoCreds.UserName, $spoCreds.Password) $clientContext.Credentials = $credentials $stream = [System.IO.File]::OpenRead('c:tempfileToMigrate.xml') $overwrite = $true $fileUrl = '/personal/admin1/Documents/file.xml' [Microsoft.SharePoint.Client.File]::SaveBinaryDirect($clientContext, $fileUrl, $stream, $overwrite) $listItem = $clientContext.Web.GetFileByServerRelativeUrl($fileUrl).ListItemAllFields $listItem["Title"] = 'Updated via script' $listItem.Update() $clientContext.ExecuteQuery() SPO Migration API The SPO API has a handful of commands to support the migration of content to SharePoint or OneDrive sites. The main advantage in this case is that the migration package is first uploaded to the Azure Blob storage. The contents are encrypted while in the temporary storage and can be processed in parallel. Being able to take advantage of the enhanced bandwidth and parallel processing makes this approach necessary when dealing with hundreds of gigabytes or many different destinations (typically the case when migrating OneDrive content). The costs of transfer and storage of your data are minimal when considering that the upload speed increases ten-fold in comparison to the CSOM approach. With this approach, you can submit multiple packages and execute them in parallel. When first released, the platform allowed up to 16 concurrent migrations; however, this number has increased lately. As an administrator, you will have to monitor the state and results of each migration package. Let's look at a few commands that will help us in achieving this: New-SPOMigrationPackage: New-SPOMigrationPackage -OutputPackagePath <String> -SourceFilesPath <String> [-IgnoreHidden <SwitchParameter>] [-IncludeFileSharePermissions <SwitchParameter>] [-NoAzureADLookup <SwitchParameter>] [-NoLogFile <SwitchParameter>] [-ReplaceInvalidCharacters <SwitchParameter>] [-TargetDocumentLibraryPath <String>] [-TargetDocumentLibrarySubFolderPath <String>] [-TargetWebUrl <String>] We begin by creating a migration package using New-SPOMigrationPackage. The command will create a package with the contents of a folder and include options to match accounts by name, include file permissions, and upload to a specific subfolder of a library:  $sourceFolder = 'C:mydocs' $packageFolder = 'C:temppackage1' $targetWeb = 'https://mytest321-my.sharepoint.com/personal/admin1' $targetLib = 'Documents' New-SPOMigrationPackage -SourceFilesPath $sourceFolder - OutputPackagePath $packageFolder ` -NoAzureADLookup ` ConvertTo-SPOMigrationTargetedPackage: The ConvertTo-SPOMigrationTargetPackage command allows you to set the target website URL, library, and folder for the migration. In the following sample, we use the ParallelImport and PartitionSizeInBytes parameters to break up the migration into multiple packages. Breaking up the upload into multiple packages can significantly reduce the overall migration time: $packages = ConvertTo-SPOMigrationTargetedPackage -ParallelImport -SourceFilesPath ` $sourceFolder -SourcePackagePath $packageFolder -OutputPackagePath $finalPackage ` -TargetWebUrl $targetWeb -TargetDocumentLibraryPath $targetLib ` -TargetDocumentLibrarySubFolderPath 'migration3' ` -Credentials $spoCreds -PartitionSizeInBytes 500MB $packages PackageDirectory FilesDirectory ---------------- -------------- 1 C:mydocs 2 C:mydocs Invoke-SPOMigrationEncryptUploadSubmit: The next step is to upload the packages. Invoke-SPOMigrationEncryptUploadSubmit will upload the contents of the package into Azure blob storage and create a migration job: $jobs = $packages | % { Invoke-SPOMigrationEncryptUploadSubmit ` -SourceFilesPath $_.FilesDirectory.FullName -SourcePackagePath $_.PackageDirectory.FullName ` -Credentials $spoCreds -TargetWebUrl $targetWeb } Creating package for folder: C:mydocs Converting package for office 365: c:tempfinalPackage $jobs JobId ReportingQueueUri ----- ----------------- f2b3e45c-e96d-4a9d-8148-dd563d4c9e1d https://sposn1ch1m016pr.queue.core.windows.net/... 78c40a16-c2de-4c29-b320-b81a38788c90 https://sposn1ch1m001pr.queue.core.windows.net/... Get-SPOMigrationJobStatus:Get-SPOMigrationJobStatus will return the status of the active jobs. This command can be used to monitor the status and wait until all the submitted jobs are completed: # retrieve job status individually foreach( $job in $jobs){ Get-SPOMigrationJobStatus -TargetWebUrl $targetWeb -Credentials $spoCreds -JobId $job.JobId } None Processing In a real-world scenario, you can use the command without the JobId parameter to get an array of the job status and wait until all are complete. Running jobs will have the 'Processing' state, and completed jobs have the 'None' status. Completed jobs are removed automatically so that the job status array is not guaranteed to have the same length on each call and will eventually be zero. In the following example, we wait until the active number of jobs is 15 or less before continuing with the script: $jobs = Get-SPOMigrationJobStatus -TargetWebUrl $targetWeb while ($jobs.Count -ge 15) { $active = $jobs | Where { $_.JobState -eq 'Processing'} Write-Host 'Too many jobs: ' $jobs.Count ' active: ' $active.Length ', pausing...'; Start-Sleep 60 $jobs = Get-SPOMigrationJobStatus -TargetWebUrl $targetWeb } Get-SPOMigrationJobProgress: The Get-SPOMigrationJobProgress command will return the result of each job; by default, a log file is placed in the folder specified in SourcePackagePath. By default, the command will wait for the job to complete unless the DontWaitForEndJob parameter is used: foreach( $job in $jobs){ Get-SPOMigrationJobProgress -AzureQueueUri $job.ReportingQueueUri.AbsoluteUri ` -Credentials $spoCreds -TargetWebUrl $targetWeb -JobIds $job.JobId -EncryptionParameters ` $job.Encryption -DontWaitForEndJob } Total Job(s) Completed = 1, with Errors = 0, with Warnings = 1 Total Job(s) Completed = 1, with Errors = 0, with Warnings = 0 Remove-SPOMigrationJob: If needed, you can manually remove jobs with the Remove-SPOMigrationJob command: $jobStatus = Get-SPOMigrationJobStatus -TargetWebUrl $targetWeb -Credentials $spoCreds -JobId $job.JobId if ($jobStatus -eq 'None'){ Write-Host 'Job completed:' $job.JobId Remove-SPOMigrationJob -JobId $job.JobId -TargetWebUrl $targetWeb -Credentials $spoCreds } Summary OneDrive offers a compelling service to store files on multiple devices and operating systems. OneDrive continues to evolve to target individuals and small collaboration groups. As an administrator, you can help your organization quickly migrate to this service and manage its use through the different scripting methodologies we reviewed. Resources for Article:  Further resources on this subject: Introducing PowerShell Remoting [article] Installing/upgrading PowerShell [article] Unleashing Your Development Skills with PowerShell [article]
Read more
  • 0
  • 0
  • 9243
Modal Close icon
Modal Close icon