Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-comparing-different-dotnet-products
Mark Price
06 Feb 2018
6 min read
Save for later

Comparing .NET products

Mark Price
06 Feb 2018
6 min read
This is an extract from the third edition of C# 7.1 and .NET Core 2.0 - Modern Cross-Platform Development by Mark Price.  Understanding the .NET framework Microsoft's .NET Framework is a development platform that includes a Common Language Runtime (CLR) that manages the execution of code, and provides a rich library of classes to build applications. Microsoft designed .NET Framework to have the possibility of being cross-platform, but Microsoft put their implementation effort into making it work best with Windows. Practically speaking, .NET Framework is Windows-only, and a legacy platform. What are Mono and Xamarin? Third parties developed a .NET implementation named the Mono project that you can read more about here. Mono is cross-platform, but it fell well behind the official implementation of .NET Framework. It has found a niche as the foundation of the Xamarin mobile platform. Microsoft purchased Xamarin in 2016 and now gives away what used to be an expensive Xamarin extension for free with Visual Studio 2017. Microsoft renamed the Xamarin Studio development tool to Visual Studio for Mac, and has given it the ability to create ASP.NET Core Web API services. Xamarin is targeted at mobile development and building cloud services to support mobile apps. What is .NET Core? Today, we live in a truly cross-platform world. Modern mobile and cloud development has made Windows a much less important operating system. So, Microsoft has been working on an effort to decouple .NET from its close ties with Windows. While rewriting .NET to be truly cross-platform, Microsoft has taken the opportunity to refactor .NET to remove major parts that are no longer considered core. This new product is branded as .NET Core, which includes a cross-platform implementation of the CLR known as CoreCLR, and a streamlined library of classes known as CoreFX. Scott Hunter, Microsoft Partner Director Program Manager for .NET, says, "Forty percent of our .NET Core customers are brand-new developers to the platform, which is what we want with .NET Core. We want to bring new people in." The following table shows when important versions of .NET Core were released, and Microsoft's schedule for the next major release: Version Released .NET Core RC1 November 2015 .NET Core 1.0 June 2016 .NET Core 1.1 November 2016 .NET Core 1.0.4 and .NET Core 1.1.1 March 2017 .NET Core 2.0 August 2017 .NET Core for UWP in Windows 10 Fall Creators Update October 2017 .NET Core 2.1 Q1 2018 .NET Core is much smaller than the current version of .NET Framework because a lot has been removed. For example, Windows Forms and Windows Presentation Foundation (WPF) can be used to build graphical user interface (GUI) applications, but they are tightly bound to Windows, so they have been removed from .NET Core. The latest technology used to build Windows apps is Universal Windows Platform (UWP), and UWP is built on a custom version of .NET Core. ASP.NET Web Forms and Windows Communication Foundation (WCF) are old web application and service technologies that fewer developers choose to use for new development projects today, so they have also been removed from .NET Core. Instead, developers prefer to use ASP.NET MVC and ASP.NET Web API. These two technologies have been refactored and combined into a new product that runs on .NET Core, named ASP.NET Core. The Entity Framework (EF) 6 is an object-relational mapping technology to work with data stored in relational databases such as Oracle and Microsoft SQL Server. It has gained baggage over the years, so the cross-platform version has been slimmed down and named Entity Framework Core. In addition to removing large pieces from .NET Framework to make .NET Core, Microsoft has componentized .NET Core into NuGet packages: small chunks of functionality that can be deployed independently. Microsoft's primary goal is not to make .NET Core smaller than .NET Framework. The goal is to componentize .NET Core to support modern technologies and to have fewer dependencies, so that deployment requires only those packages that your application needs. What is .NET standard? The situation with .NET today is that there are three forked .NET platforms, all controlled by Microsoft: .NET Framework .NET Core Xamarin Each have different strengths and weaknesses because they are designed for different scenarios. This has led to the problem that a developer must learn three platforms, each with annoying quirks and limitations. So, Microsoft defined .NET Standard 2.0: a specification for a set of APIs that all .NET platforms must implement. You cannot install .NET Standard 2.0 in the same way that you cannot install HTML5. To use HTML5, you must install a web browser that implements the HTML5 specification. To use .NET Standard 2.0, you must install a .NET platform that implements the .NET Standard 2.0 specification. .NET Standard 2.0 is implemented by the latest versions of .NET Framework, .NET Core, and Xamarin. .NET Standard 2.0 makes it much easier for developers to share code between any flavor of .NET. For .NET Core 2.0, this adds many of the missing APIs that developers need to port old code written for .NET Framework to the cross-platform .NET Core. However, some APIs are implemented, but throw an exception to indicate to a developer that they should not actually be used! This is usually due to differences in the operating system on which you run .NET Core. .NET Native Another .NET initiative is .NET Native. This compiles C# code to native CPU instructions ahead-of-time (AoT), rather than using the CLR to compile intermediate language (IL) code just-in-time (JIT) to native code later. .NET Native improves execution speed and reduces the memory footprint for applications. It supports the following: UWP apps for Windows 10, Windows 10 Mobile, Xbox One, HoloLens, and Internet of Things (IoT) devices such asRaspberry Pi Server-side web development with ASP.NET Core Console applications for use on the command line Comparing different .NET tools Technology Feature set Compiles to Host OSes .NET framework Both legacy and modern IL code Windows only Xamarin Mobile only IL code iOS, Android, Windows Mobile .NET Core Modern only IL code Windows, macOS, Linux .NET Native Modern only Native code  Windows, macOS, Linux Thanks for reading this extract from C# 7.1 and .NET Core 2.0 - Modern Cross Platform Development! If you want to learn more about .NET, dive into Mark Price's book, or explore more .NET resources here.
Read more
  • 0
  • 0
  • 7135

article-image-how-to-create-a-neural-network-in-tensorflow
Aaron Lazar
06 Feb 2018
8 min read
Save for later

How to Create a Neural Network in TensorFlow

Aaron Lazar
06 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article has been extracted from the book Principles of Data Science authored by Sinan Ozdemir. With a unique approach that bridges the gap between mathematics and computer science, the books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques to help you get to grips with machine learning.[/box] In this article, we’re going to learn how to create a neural network whose goal will be to classify images. Tensorflow is an open-source machine learning module that is used primarily for its simplified deep learning and neural network abilities. I would like to take some time to introduce the module and solve a few quick problems using tensorflow. Let’s begin with some imports: from sklearn import datasets, metrics import tensorflow as tf import numpy as np from sklearn.cross_validation import train_test_split %matplotlib inline Loading our iris dataset: # Our data set of iris flowers iris = datasets.load_iris() # Load datasets and split them for training and testing X_train, X_test, y_train, y_test = train_test_split(iris.data, iris. target) Creating the Neural Network: # Specify that all features have real-value datafeature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) # Fit model. classifier.fit(x=X_train, y=y_train, steps=2000) Notice that our code really hasn't changed from the last segment. We still have our feature_columns from before, but now we introduce, instead of a linear classifier, a DNNClassifier, which stands for Deep Neural Network Classifier. This is TensorFlow's syntax for implementing a neural network. Let's take a closer look: tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) We see that we are inputting the same feature_columns, n_classes, and optimizer, but see how we have a new parameter called hidden_units? This list represents the number of nodes to have in each layer between the input and the output layer. All in all, this neural network will have five layers: The first layer will have four nodes, one for each of the iris feature variables. This layer is the input layer. A hidden layer of 10 nodes. A hidden layer of 20 nodes. A hidden layer of 10 nodes. The final layer will have three nodes, one for each possible outcome of the network. This is called our output layer. Now that we've trained our model, let's evaluate it on our test set: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=X_test, y=y_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.921053 Hmm, our neural network didn't do so well on this dataset, but perhaps it is because the network is a bit too complicated for such a simple dataset. Let's introduce a new dataset that has a bit more to it… The MNIST dataset consists of over 50,000 handwritten digits (0-9) and the goal is to recognize the handwritten digits and output which letter they are writing. Tensorflow has a built-in mechanism for downloading and loading these images. from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=False) Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz Notice that one of our inputs for downloading mnist is called one_hot. This parameter either brings in the dataset's target variable (which is the digit itself) as a single number or has a dummy variable. For example, if the first digit were a 7, the target would either be: 7: If one_hot was false 0 0 0 0 0 0 0 1 0 0: If one_hot was true (notice that starting from 0, the seventh index is a 1) We will encode our target the former way, as this is what our tensorflow neural network and our sklearn logistic regression will expect. The dataset is split up already into a training and test set, so let's create new variables to hold them: x_mnist = mnist.train.images y_mnist = mnist.train.labels.astype(int) For the y_mnist variable, I specifically cast every target as an integer (by default they come in as floats) because otherwise tensorflow would throw an error at us. Out of curiosity, let's take a look at a single image: import matplotlib.pyplot as plt plt.imshow(x_mnist[10].reshape(28, 28)) And hopefully our target variable matches at the 10th index as well: y_mnist[10] 0 Excellent! Let's now take a peek at how big our dataset is: x_mnist.shape (55000, 784) y_mnist.shape (55000,) Our training size then is 55000 images and target variables. Let's fit a deep neural network to our images and see if it will be able to pick up on the patterns in our inputs: # Specify that all features have real-value data feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,     hidden_units=[10, 20, 10],   optimizer=optimizer, n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=1000) # Warning this is veryyyyyyyy slow This code is very similar to our previous segment using DNNClassifier; however, look how in our first line of code, I have changed the number of columns to be 784 while in the classifier itself, I changed the number of output classes to be 10. These are manual inputs that tensorflow must be given to work. The preceding code runs very slowly. It is little by little adjusting itself in order to get the best possible performance from our training set. Of course, we know that the ultimate test here is testing our network on an unknown test set, which is also given to us from tensorflow: x_mnist_test = mnist.test.images y_mnist_test = mnist.test.labels.astype(int) x_mnist_test.shape (10000, 784) y_mnist_test.shape (10000,) So we have 10,000 images to test on; let's see how our network was able to adapt to the dataset: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test, y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.920600 Not bad, 92% accuracy on our dataset. Let's take a second and compare this performance to a standard sklearn logistic regression now: logreg = LogisticRegression() logreg.fit(x_mnist, y_mnist) # Warning this is slow y_predicted = logreg.predict(x_mnist_test) from sklearn.metrics import accuracy_score # predict on our test set, to avoid overfitting! accuracy = accuracy_score(y_predicted, y_mnist_test) # get our accuracy score Accuracy 0.91969 Success! Our neural network performed better than the standard logistic regression. This is likely because the network is attempting to find relationships between the pixels themselves and using these relationships to map them to what digit we are writing down. In logistic regression, the model assumes that every single input is independent of one another, and therefore has a tough time finding relationships between them. There are ways of making our neural network learn differently: We could make our network wider, that is, increase the number of nodes in the hidden layers instead of having several layers of a smaller number of nodes: # A wider network feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_ columns=feature_columns,      hidden_units=[1500],       optimizer=optimizer,    n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=100) # Warning this is veryyyyyyyy slow # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test,    y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.898400 We could increase our learning rate, forcing the network to attempt to converge into an answer faster. As mentioned before, we run the risk of the model skipping the answer entirely if we go down this route. It is usually better to stick with a smaller learning rate. We can change the method of optimization. Gradient descent is very popular; however, there are other algorithms for doing so. One example is called the Adam Optimizer. The difference is in the way they traverse the error function, and therefore the way that they approach the optimization point. Different problems in different domains call for different optimizers. There is no replacement for a good old fashioned feature selection phase instead of attempting to let the network figure everything out for us. We can take the time to find relevant and meaningful features that actually will allow our network to find an answer quicker! There you go! You’ve now learned how to build a neural net in Tensorflow! If you liked this tutorial and would like to learn more, head over and grab the copy Principles of Data Science. If you want to take things a bit further and learn how to classify Irises using multi-layer perceptrons, head over here.    
Read more
  • 0
  • 0
  • 36307

article-image-implementing-fault-tolerance-in-spark-streaming-data-processing-applications-with-apache-kafka
Pravin Dhandre
01 Feb 2018
16 min read
Save for later

Implementing fault-tolerance in Spark Streaming data processing applications with Apache Kafka

Pravin Dhandre
01 Feb 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajanarayanan Thottuvaikkatumana titled Apache Spark 2 for Beginners. This book is a developer’s guide for developing large-scale and distributed data processing applications in their business environment. [/box] Data processing is generally carried in two ways, either in batch or stream processing. This article will help you learn how to start processing your data uninterruptedly and build fault-tolerance as and when the data gets generated in real-time Message queueing systems with publish-subscribe capability are generally used for processing messages. The traditional message queueing systems failed to perform because of the huge volume of messages to be processed per second for the needs of large-scale data processing applications. Kafka is a publish-subscribe messaging system used by many IoT applications to process a huge number of messages. The following capabilities of Kafka made it one of the most widely used messaging systems: Extremely fast: Kafka can process huge amounts of data by handling reading and writing in short intervals of time from many application clients Highly scalable: Kafka is designed to scale up and scale out to form a cluster using commodity hardware Persists a huge number of messages: Messages reaching Kafka topics are persisted into the secondary storage, while at the same time it is handling huge number of messages flowing through The following are some of the important elements of Kafka, and are terms to be understood before proceeding further: Producer: The real source of the messages, such as weather sensors or mobile phone network Broker: The Kafka cluster, which receives and persists the messages published to its topics by various producers Consumer: The data processing applications subscribed to the Kafka topics that consume the messages published to the topics The same log event processing application use case discussed in the preceding section is used again here to elucidate the usage of Kafka with Spark Streaming. Instead of collecting the log event messages from the TCP socket, here the Spark Streaming data processing application will act as a consumer of a Kafka topic and the messages published to the topic will be consumed. The Spark Streaming data processing application uses the version 0.8.2.2 of Kafka as the message broker, and the assumption is that the reader has already installed Kafka, at least in a standalone mode. The following activities are to be performed to make sure that Kafka is ready to process the messages produced by the producers and that the Spark Streaming data processing application can consume those messages: Start the Zookeeper that comes with Kafka installation. Start the Kafka server. Create a topic for the producers to send the messages to. Pick up one Kafka producer and start publishing log event messages to the newly created topic. Use the Spark Streaming data processing application to process the log eventspublished to the newly created topic. Starting Zookeeper and Kafka The following scripts are run from separate terminal windows in order to start Zookeeper and the Kafka broker, and to create the required Kafka topics: $ cd $KAFKA_HOME $ $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties [2016-07-24 09:01:30,196] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory) $ $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties [2016-07-24 09:05:06,381] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2016-07-24 09:05:06,455] INFO [Kafka Server 0], started (kafka.server.KafkaServer) $ $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic sfb Created topic "sfb". $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb The Kafka message producer can be any application capable of publishing messages to the Kafka topics. Here, the kafka-console-producer that comes with Kafka is used as the producer of choice. Once the producer starts running, whatever is typed into its console window will be treated as a message that is published to the chosen Kafka topic. The Kafka topic is given as a command line argument when starting the kafka-console-producer. The submission of the Spark Streaming data processing application that consumes log event messages produced by the Kafka producer is slightly different from the application covered in the preceding section. Here, many Kafka jar files are required for the data processing. Since they are not part of the Spark infrastructure, they have to be submitted to the Spark cluster. The following jar files are required for the successful running of this application: $KAFKA_HOME/libs/kafka-clients-0.8.2.2.jar $KAFKA_HOME/libs/kafka_2.11-0.8.2.2.jar $KAFKA_HOME/libs/metrics-core-2.2.0.jar $KAFKA_HOME/libs/zkclient-0.3.jar Code/Scala/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar Code/Python/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar In the preceding list of jar files, the maven repository co-ordinate for spark-streamingkafka-0-8_2.11-2.0.0-preview.jar is "org.apache.spark" %% "sparkstreaming-kafka-0-8" % "2.0.0-preview". This particular jar file has to be downloaded and placed in the lib folder of the directory structure given in Figure 4. It is being used in the submit.sh and the submitPy.sh scripts, which submit the application to the Spark cluster. The download URL for this jar file is given in the reference section of this chapter. In the submit.sh and submitPy.sh files, the last few lines contain a conditional statement looking for the second parameter value of 1 to identify this application and ship the required jar files to the Spark cluster. Implementing the application in Scala The following code snippet is the Scala code for the log event processing application that processes the messages produced by the Kafka producer. The use case of this application is the same as the one discussed in the preceding section concerning windowing operations: /** The following program can be compiled and run using SBT Wrapper scripts have been provided with this The following script can be run to compile the code ./compile.sh The following script can be used to run this application in Spark. The  second command line argument of value 1 is very important. This is to flag the shipping of the kafka jar files to the Spark cluster ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 **/ package com.packtpub.sfb import java.util.HashMap import org.apache.spark.streaming._ import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.streaming.kafka._ import org.apache.kafka.clients.producer.{ProducerConfig, KafkaProducer, ProducerRecord} object KafkaStreamingApps { def main(args: Array[String]) { // Log level settings LogSettings.setLogLevels() // Variables used for creating the Kafka stream //The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Start the streaming ssc.start() // Wait till the application is terminated ssc.awaitTermination() } } Compared to the Scala code in the preceding section, the major difference is in the way the stream is created. Implementing the application in Python The following code snippet is the Python code for the log event processing application that processes the message produced by the Kafka producer. The use case of this application is also the same as the one discussed in the preceding section concerning windowing operations: # The following script can be used to run this application in Spark # ./submitPy.sh KafkaStreamingApps.py 1 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils if __name__ == "__main__": # Create the Spark context sc = SparkContext(appName="PythonStreamingApp") # Necessary log4j logging level settings are done log4j = sc._jvm.org.apache.log4j log4j.LogManager.getRootLogger().setLevel(log4j.Level.WARN) # Create the Spark Streaming Context with 10 seconds batch interval ssc = StreamingContext(sc, 10) # Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("tmp") # The quorum of Zookeeper hosts zooKeeperQuorum="localhost" # Message group name messageGroup="sfb-consumer-group" # Kafka topics list separated by coma if there are multiple topics to be listened on topics = "sfb" # Number of threads per topic numThreads = 1 # Create a Kafka DStream kafkaStream = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, {topics: numThreads}) # Create the Kafka stream appLogLines = kafkaStream.map(lambda x: x[1]) # Count each log messge line containing the word ERROR errorLines = appLogLines.filter(lambda appLogLine: "ERROR" in appLogLine) # Print the first ten elements of each RDD generated in this DStream to the console errorLines.pprint() errorLines.countByWindow(30,10).pprint() # Start the streaming ssc.start() # Wait till the application is terminated ssc.awaitTermination() The following commands are run on the terminal window to run the Scala application: $ cd Scala $ ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 The following commands are run on the terminal window to run the Python application: $ cd Python $ ./submitPy.sh KafkaStreamingApps.py 1 When both of the preceding programs are running, whatever log event messages are typed into the console window of the Kafka console producer, and invoked using the following command and inputs, will be processed by the application. The outputs of this program will be very similar to the ones that are given in the preceding section: $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb [Fri Dec 20 01:46:23 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:46:23 2015] [WARN] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:54:34 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /apache/web/test Spark provides two approaches to process Kafka streams. The first one is the receiver-based approach that was discussed previously and the second one is the direct approach. This direct approach to processing Kafka messages is a simplified method in which Spark Streaming is using all the possible capabilities of Kafka just like any of the Kafka topic consumers, and polls for the messages in the specific topic, and the partition by the offset number of the messages. Depending on the batch interval of the Spark Streaming data processing application, it picks up a certain number of offsets from the Kafka cluster, and this range of offsets is processed as a batch. This is highly efficient and ideal for processing messages with a requirement to have exactly-once processing. This method also reduces the Spark Streaming library's need to do additional work to implement the exactly-once semantics of the message processing and delegates that responsibility to Kafka. The programming constructs of this approach are slightly different in the APIs used for the data processing. Consult the appropriate reference material for the details. The preceding sections introduced the concept of a Spark Streaming library and discussed some of the real-world use cases. There is a big difference between Spark data processing applications developed to process static batch data and those developed to process dynamic stream data in a deployment perspective. The availability of data processing applications to process a stream of data must be constant. In other words, such applications should not have components that are single points of failure. The following section is going to discuss this topic. Spark Streaming jobs in production When a Spark Streaming application is processing the incoming data, it is very important to have uninterrupted data processing capability so that all the data that is getting ingested is processed. In business-critical streaming applications, most of the time missing even one piece of data can have a huge business impact. To deal with such situations, it is important to avoid single points of failure in the application infrastructure. From a Spark Streaming application perspective, it is good to understand how the underlying components in the ecosystem are laid out so that the appropriate measures can be taken to avoid single points of failure. A Spark Streaming application deployed in a cluster such as Hadoop YARN, Mesos or Spark Standalone mode has two main components very similar to any other type of Spark application: Spark driver: This contains the application code written by the user Executors: The executors that execute the jobs submitted by the Spark driver But the executors have an additional component called a receiver that receives the data getting ingested as a stream and saves it as blocks of data in memory. When one receiver is receiving the data and forming the data blocks, they are replicated to another executor for fault-tolerance. In other words, in-memory replication of the data blocks is done onto a different executor. At the end of every batch interval, these data blocks are combined to form a DStream and sent out for further processing downstream. Figure 1 depicts the components working together in a Spark Streaming application infrastructure deployed in a cluster: In Figure 1, there are two executors. The receiver component is deliberately not displayed in the second executor to show that it is not using the receiver and instead just collects the replicated data blocks from the other executor. But when needed, such as on the failure of the first executor, the receiver in the second executor can start functioning. Implementing fault-tolerance in Spark Streaming data processing applications Spark Streaming data processing application infrastructure has many moving parts. Failures can happen to any one of them, resulting in the interruption of the data processing. Typically failures can happen to the Spark driver or the executors. When an executor fails, since the replication of data is happening on a regular basis, the task of receiving the data stream will be taken over by the executor on which the data was getting replicated. There is a situation in which when an executor fails, all the data that is unprocessed will be lost. To circumvent this problem, there is a way to persist the data blocks into HDFS or Amazon S3 in the form of write-ahead logs. When the Spark driver fails, the driven program is stopped, all the executors lose connection, and they stop functioning. This is the most dangerous situation. To deal with this situation, some configuration and code changes are necessary. The Spark driver has to be configured to have an automatic driver restart, which is supported by the cluster managers. This includes a change in the Spark job submission method to have the cluster mode in whichever may be the cluster manager. When a restart of the driver happens, to start from the place when it crashed, a checkpointing mechanism has to be implemented in the driver program. This has already been done in the code samples that are used. The following lines of code do that job: ssc = StreamingContext(sc, 10) ssc.checkpoint("tmp") From an application coding perspective, the way the StreamingContext is created is slightly different. Instead of creating a new StreamingContext every time, the factory method getOrCreate of the StreamingContext is to be used with a function, as shown in the following code segment. If that is done, when the driver is restarted, the factory method will check the checkpoint directory to see whether an earlier StreamingContext was in use, and, if found in the checkpoint data, it is created. Otherwise, a new StreamingContext is created. The following code snippet gives the definition of a function that can be used with the getOrCreate factory method of the StreamingContext. As mentioned earlier, a detailed treatment of these aspects is beyond the scope of this book: /** * The following function has to be used when the code is being restructured to have checkpointing and driver recovery * The way it should be used is to use the StreamingContext.getOrCreate with this function and do a start of that */ def sscCreateFn(): StreamingContext = { // Variables used for creating the Kafka stream // The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Return the streaming context ssc } At a data source level, it is a good idea to build parallelism for faster data processing and, depending on the source of data, this can be accomplished in different ways. Kafka inherently supports partition at the topic level, and that kind of scaling out mechanism supports a good amount of parallelism. As a consumer of Kafka topics, the Spark Streaming data processing application can have multiple receivers by creating multiple streams, and the data generated by those streams can be combined by the union operation on the Kafka streams. The production deployment of Spark Streaming data processing applications is to be done purely based on the type of application that is being used. Some of the guidelines given previously are just introductory and conceptual in nature. There is no silver bullet approach to solving production deployment problems, and they have to evolve along with the application development. To summarize, we looked at the production deployment of Spark Streaming data processing applications and the possible ways of implementing fault-tolerance in Spark Streaming and data processing applications using Kafka. To explore more critical and equally important Spark tools such as Spark GraphX, Spark MLlib, DataFrames etc, do check out Apache Spark 2 for Beginners  to develop efficient large-scale applications with Apache Spark.  
Read more
  • 0
  • 0
  • 17218

article-image-how-to-run-spark-in-mesos
Sunith Shetty
31 Jan 2018
6 min read
Save for later

How to run Spark in Mesos

Sunith Shetty
31 Jan 2018
6 min read
This article is an excerpt from a book written by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, you will learn how to perform big data analytics using Spark streaming, machine learning techniques and more. From the article given below, you will learn how to operate Spark in Mesos cluster manager. What is Mesos? Mesos is an open source cluster manager started as a UC Berkley research project in 2008 and quite widely used by a number of organizations. Spark supports Mesos, and Matei Zahria has given a keynote at Mesos Con in June of 2016. Here is a link to the YouTube video of the keynote. Before you start If you haven't installed Mesos previously, the getting started page on the Apache website gives a good walk through of installing Mesos on Windows, MacOS, and Linux. Follow the URL https://mesos.apache.org/getting-started/. Once installed you need to start-up Mesos on your cluster Starting Mesos Master: ./bin/mesos-master.sh -ip=[MasterIP] -workdir=/var/lib/mesos Start Mesos Agents on all your worker nodes: ./bin/mesos-agent.sh - master=[MasterIp]:5050 -work-dir=/var/lib/mesos Make sure Mesos is up and running with all your relevant worker nodes configured: http://[MasterIP]@5050 Make sure that Spark binary packages are available and accessible by Mesos. They can be placed on a Hadoop-accessible URI for example: HTTP via http:// S3 via s3n:// HDFS via hdfs:// You can also install spark in the same location on all the Mesos slaves, and configure spark.mesos.executor.home to point to that location. Running in Mesos Mesos can have single or multiple masters, which means the Master URL differs when submitting application from Spark via mesos: Single Master Mesos://sparkmaster:5050 Multiple Masters (Using Zookeeper) Mesos://zk://master1:2181, master2:2181/mesos Modes of operation in Mesos Mesos supports both the Client and Cluster modes of operation: Client mode Before running the client mode, you need to perform couple of configurations: Spark-env.sh Export MESOS_NATIVE_JAVA_LIBRARY=<Path to libmesos.so [Linux]> or <Path to libmesos.dylib[MacOS]> Export SPARK_EXECUTOR_URI=<URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3> Set spark.executor.uri to URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3 Batch Applications For batch applications, in your application program you need to pass on the Mesos URL as the master when creating your Spark context. As an example: val sparkConf = new SparkConf()                .setMaster("mesos://mesosmaster:5050")                .setAppName("Batch Application")                .set("spark.executor.uri", "Location to Spark binaries                (Http, S3, or HDFS)") val sc = new SparkContext(sparkConf) If you are using Spark-submit, you can configure the URI in the conf/sparkdefaults.conf file using spark.executor.uri. Interactive applications When you are running one of the provided spark shells for interactive querying, you can pass the master argument e.g: ./bin/spark-shell -master mesos://mesosmaster:5050 Cluster mode Just as in YARN, you run spark on mesos in a cluster mode, which means the driver is launched inside the cluster and the client can disconnect after submitting the application, and get results from the Mesos WebUI. Steps to use the cluster mode Start the MesosClusterDispatcher in your cluster: ./sbin/start-mesos-dispatcher.sh -master mesos://mesosmaster:5050. This will generally start the dispatcher at port 7077. From the client, submit a job to the mesos cluster by Spark-submit specifying the dispatcher URL. Example:        ./bin/spark-submit        --class org.apache.spark.examples.SparkPi        --master mesos://dispatcher:7077        --deploy-mode cluster        --supervise        --executor-memory 2G        --total-executor-cores 10        s3n://path/to/examples.jar Similar to Spark Mesos has lots of properties that can be set to optimize the processing. You should refer to the Spark Configuration page (http://spark.apache.org/docs/latest/configuration.html) for more Information. Mesos run modes Spark can run on Mesos in two modes: Coarse Grained (default-mode): Spark will acquire a long running Mesos task on each machine. This offers a much cost of statup, but the resources will continue to be allocated to spark for the complete duration of the application. Fine Grained (deprecated): The fine grained mode is deprecated as in this case each mesos task is created per Spark task. The benefit of this is each application receives cores as per its requirements, but the initial bootstrapping might act as a deterrent for interactive applications. Key Spark on Mesos configuration properties While Spark has a number of properties that can be configured to optimize Spark processing, some of these properties are specific to Mesos. We'll look at few of those key properties here. Property Name Meaning/Default Value spark.mesos.coarse Setting it to true (default value), will run Mesos in coarse grained mode. Setting it to false will run it in fine-grained mode. spark.mesos.extra.cores This is more of an advertisement rather than allocation in order to improve parallelism. An executor will pretend that it has extra cores resulting in the driver sending it more work. Default=0 spark.mesos.mesosExecutor.cores Only works in fine grained mode. This specifies how many cores should be given to each Mesos executor. spark.mesos.executor.home Identifies the directory of Spark installation for the executors in Mesos. As discussed, you can specify this using spark.executor.uri as well, however if you have not specified it, you can specify it using this property. spark.mesos.executor.memoryOverhead The amount of memory (in MBs) to be allocated per executor. spark.mesos.uris A comma separated list of URIs to be downloaded when the driver or executor is launched by Mesos. spark.mesos.prinicipal The name of the principal used by Spark to authenticate itself with Mesos.   You can find other configuration properties at the Spark documentation page (http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties). To summarize, we covered the objective to get you started with running Spark on Mesos. To know more about Spark SQL, Spark Streaming, Machine Learning with Spark, you can refer to the book Learning Apache Spark 2.
Read more
  • 0
  • 0
  • 11523

article-image-getting-started-with-data-storytelling
Aaron Lazar
28 Jan 2018
11 min read
Save for later

Getting Started with Data Storytelling

Aaron Lazar
28 Jan 2018
11 min read
[box type="note" align="" class="" width=""]This article has been taken from the book Principles of Data Science, written by Sinan Ozdemir. It aims to practically introduce you to the different ways in which you can communicate or visualize your data to tell stories effectively.[/box] Communication matters Being able to conduct experiments and manipulate data in a coding language is not enough to conduct practical and applied data science. This is because data science is, generally, only as good as how it is used in practice. For instance, a medical data scientist might be able to predict the chance of a tourist contracting Malaria in developing countries with >98% accuracy, however, if these results are published in a poorly marketed journal and online mentions of the study are minimal, their groundbreaking results that could potentially prevent deaths would never see the true light of day. For this reason, communication of results through data storytelling is arguably as important as the results themselves. A famous example of poor management of distribution of results is the case of Gregor Mendel. Mendel is widely recognized as one of the founders of modern genetics. However, his results (including data and charts) were not well adopted until after his death. Mendel even sent them to Charles Darwin, who largely ignored Mendel's papers, which were written in unknown Moravian journals. Generally, there are two ways of presenting results: verbal and visual. Of course, both the verbal and visual forms of communication can be broken down into dozens of subcategories, including slide decks, charts, journal papers, and even university lectures. However, we can find common elements of data presentation that can make anyone in the field more aware and effective in their communication skills. Let's dive right into effective (and ineffective) forms of communication, starting with visuals. We’ll look at four basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots. Scatter plots A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation. For example, we can look at two variables: average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance. The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on an average, in a day against a company-standard work performance metric: import pandas as pd hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5] This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day. work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72] This line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100. For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100: df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_ performance':work_performance}) Here, we are creating a Dataframe in order to ease our exploratory data analysis and make it easier to make a scatter plot: df.plot(x='hours_tv_watched', y='work_performance', kind='scatter') Now, we are actually making our scatter plot. In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person's work performance metric: Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance. Of course, as we are now experts in statistics from the last two chapters, we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association between but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have. Line graphs Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables. As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme—he wondered if he could find a relationship between the TV show, The X-Files, and the amount of UFO sightings in the U.S.. He then found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when the X-files were released: It appears to be clear that right after 1993, the year of the X-Files premier, the number of UFO sightings started to climb drastically. This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author's intent, which is to show a relationship between the number of UFO sightings and the X-files premier. On the other hand, the following is a less impressive line chart: This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph—we have time on the bottom x axis and a quantitative value on the vertical y axis. The (not so) subtle difference here is that the three points are equally spaced out on the x axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separates the last two points. Bar charts We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x axis does not represent a quantitative variable, in fact, when using a bar chart, the x axis is generally a categorical variable, while the y axis is quantitative. Note that, for this code, I am using the World Health Organization's report on alcohol consumption around the world by country: drinks = pd.read_csv('data/drinks.csv') drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent') plt.xlabel('Continent') plt.ylabel('Count') The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the least: In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown: drinks.groupby('continent').beer_servings.mean().plot(kind='bar') Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values. We can also use bar charts to graph variables that change over time, like a line graph. Histograms Histograms show the frequency distribution of a single quantitative variable by splitting up the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x axis is a bin (subrange) of values and the y axis is a count. As an example, I will import a store's daily number of unique customers, as shown: rossmann_sales = pd.read_csv('data/rossmann.csv') rossmann_sales.head() Note how we have multiple store data (by the first Store column). Let's subset this data for only the first store, as shown: first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1] Now, let's plot a histogram of the first store's customer count: first_rossmann_sales['Customers'].hist(bins=20) plt.xlabel('Customer Bins') plt.ylabel('Count') The x axis is now categorical in that each category is a selected range of values, for example, 600-620 customers would potentially be a category. The y axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700. Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on. Box plots Box plots are also used to show a distribution of values. They are created by plotting the five number summary, as follows: The minimum value The first quartile (the number that separates the 25% lowest values from the rest) The median The third quartile (the number that separates the 25% highest values from the rest) The maximum value In Pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile. The following is a series of box plots showing the distribution of beer consumption according to continents: drinks.boxplot(column='beer_servings', by='continent') Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America. Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot. Getting back to the customer data, let's look at the same store customer numbers, but using a box plot: first_rossmann_sales.boxplot(column='Customers', vert=False) This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both the graphs one after the other: Note how the x axis for each graph are the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center of the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people's biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers. Note that we can get the exact numbers to construct a box plot using the describe feature in Pandas, as shown: first_rossmann_sales['Customers'].describe() min 0.000000 25% 463.000000 50% 529.000000 75% 598.750000 max 1130.000000 There we have it! We just learned data storytelling through various techniques like scatter plots, line graphs, bar charts, histograms and box plots. Now you’ve got the power to be creative in the way you tell tales of your data! If you found our article useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.    
Read more
  • 0
  • 0
  • 16096

article-image-how-to-build-a-gaussian-mixture-model
Gebin George
27 Jan 2018
8 min read
Save for later

How to build a Gaussian Mixture Model

Gebin George
27 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster),  np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars  and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named  with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters))     category = pm.Categorical('category', p=p, shape=n_total)    means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[p])    trace_kg = pm.sample(10000, step=[step1, step2])      chain_kg = trace_kg[1000:]       varnames_kg = ['p']    pm.traceplot(chain_kg, varnames_kg)   Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total)    means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters)    sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix)    step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[means, sd, p])    trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames)   mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']:    sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of  and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.    
Read more
  • 0
  • 0
  • 54379
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-how-to-integrate-sharepoint-with-sql-server-reporting-services
Kunal Chaudhari
27 Jan 2018
5 min read
Save for later

How to integrate SharePoint with SQL Server Reporting Services

Kunal Chaudhari
27 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you get up and running with the latest enhancements and advanced query and reporting feature in SQL Server 2016.[/box] Today we will learn the steps to integrate SharePoint in the SQL Server Reporting services. We will create a Reporting Services SharePoint application, and set it up in a way that we are able to view reports when they are uploaded to SharePoint. Getting ready For this, all you'll need is a SharePoint instance you can work with. Do make sure you have an administrative access to the SharePoint site. If you have an Azure account, free or paid, you could set up a test instance of SharePoint and use it to follow the instructions in this article. Note the setup of such an Azure instance is outside the scope of this article. In this article, we assume you are using an on premise SharePoint installation. How to do it… Open the SharePoint 2016 Central Administration web page. Click on Manage service applications under the Application Management area: 3. The Service Applications tab now appears at the top of the page. Click on the New menu: 4. In the menu, find and click on the option for SQL Server Reporting Services Service Application: 5. You'll now need to fill out the information for the service application. Start at the top by giving it a good name, here we are using SSRS_SharePoint. 6. Presumably this is a new install, so you'll have to take the Create new application pool option. Give it an appropriate name; in this example, we used SSRS_SharePoint_Pool. 7. Select a security account to run under. Here we selected an account set up by our Active Directory administrator, which has permissions to SQL Server where SSRS is installed. 8. Enter the name of the server which has SQL Server 2016 Reporting Services installed. In this example, our machine is ACSrv. 9. By default, SharePoint will create a name for the database that includes a GUID (a long string of letters and numbers). You should absolutely rename this to eliminate the GUID, but ensure the database name will be unique. In this example, we used ReportingService_SharePoint. 10. Review the information so that it resembles the following figure, but don't hit OK quite yet as there are few more pieces of information to fill out. Scroll down in the dialog to continue: 11. After the database name, you'll need to indicate the authentication method. Assuming the credentials you entered for the security account came from your Active Directory administrator, you can take the default of Windows authentication. 12. Place a check mark beside the instance of SharePoint to associate this SSRS application with. Here there is only one, SharePoint – 80. 13. Click OK to continue. Assuming all goes well, you should see the following confirmation dialog. If so, click OK to proceed: 14. Now that SharePoint is configured, you'll now need to provide additional information to SQL Server. That is the purpose of this final screen, Provision Subscriptions and Alerts. Select the Download Script button, and save the generated SQL file: 15. Pass the SQL file to a database administrator to execute, or open it in SSMS and execute it yourself, assuming you have administrative rights on the SQL Server. SharePoint uses the concept of Service Applications to manage items which run under the hood of SharePoint. SQL Server Reporting Services is one such service application. By integrating it as a service application, end users can upload, modify, and view SSRS reports right within SharePoint. We began by generating a new Service Application, and picking Reporting Services from the list. We then needed to let SharePoint know where the SQL Server would be used to host both the database, as well as have a copy of Reporting Services for SharePoint installed. In addition, we also needed to provide security credentials for SharePoint to use to communicate with SQL Server. As the final step, we needed to configure SQL Server to now work with SharePoint. This was the purpose of the Provision Subscriptions and Alerts screen. Note there is an option to fill out a user name and credential; clicking OK would then have immediately executed scripts against the target SQL Server. In most mid-to large-size corporations, however, there will be controls in place to prevent this type of thing. Most companies will require a DBA to review scripts, or at the very least you'll want to keep a copy of the script in your source control system to be able to track what changes were made to a SQL Server. Hence, we suggest taking the action laid out in this article, namely downloading the script and executing it manually in the SQL Server Management Studio. To test your setup, we suggest creating a new report with embedded data sources and datasets. Upload that report to the server, and attempt to execute; it should display correctly if your install went well. If you enjoyed this excerpt, check out the book SQL Server 2016 Reporting Services Cookbook to know more about handling security and configuring email with SharePoint using Reporting Services.    
Read more
  • 0
  • 0
  • 40984

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which
Gebin George
26 Jan 2018
5 min read
Save for later

What are discriminative and generative models and when to use which?

Gebin George
26 Jan 2018
5 min read
[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of  for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda:     mus = pm.Normal('mus', mu=0, sd=10, shape=2)    sigmas = pm.Uniform('sigmas', 0, 10)  setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50])      versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:])      bd = pm.Deterministic('bd', (mus[0]+mus[1])/2)    start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python  to solve complex statistical problems with Bayesian Framework and Python.    
Read more
  • 0
  • 0
  • 12545

article-image-working-with-kibana-in-elasticsearch-5-x
Savia Lobo
26 Jan 2018
9 min read
Save for later

Working with Kibana in Elasticsearch 5.x

Savia Lobo
26 Jan 2018
9 min read
[box type="note" align="" class="" width=""]Below given post is a book excerpt from Mastering Elasticsearch 5.x written by  Bharvi Dixit. This book introduces you to the new features of Elasticsearch 5.[/box] The following article showcases Kibana, a tool belongs to the Elastic Stack, and used for visualization and exploration of data residing in Elasticsearch. One can install Kibana and start to explore Elasticsearch indices in minutes — no code, no additional infrastructure required. If you have been using an older version of Kibana, you will notice that it has transformed altogether in terms of functionality. Note: This URL has all the latest changes done in Kibana 5.0: https://www.elastic.co/guide/en/kibana/current/breaking-changes- 5.0.html. Installing Kibana Similar to other Elastic Stack tools, you can visit the following URL to download Kibana 5.0.0, as per your operating system distribution: https://www.elastic.co/downloads/past-releases/kibana-5-0-0 An example of downloading and installing Kibana from the Debian package. First of all, download the package: https://artifacts.elastic.co/downloads/kibana/kibana-5.0.0-amd64.deb Then install it using the following command: sudo dpkg -i kibana-5.0.0-amd64.deb Kibana configuration Once installed, you can find the Kibana configuration file, kibana.yml, inside the/etc/kibana/ directory. All the settings related to Kibana are done only in this file. There is a big list of configuration options available inside the Kibana settings which you can learn about here: https://www.elastic.co/guide/en/kibana/current/settings.html. Starting Kibana Kibana can be started using the following command and it will be started on port 5601 bounded on localhost by default: sudo service kibana start Exploring and visualizing data on Kibana Now all the components of Elastic Stack are installed and configured, we can start exploring the awesomeness of Kibana visualizations. Kibana 5.x is supported on almost all of the latest major web browsers, including Internet Explorer 11+. To load Kibana, you just need to type localhost:5601 in your web browser. You will see different options available in the left panel of the screen, as shown in following figure: These different options are used for the following purposes: Discover: Used for data exploration where you get the access of each field along with a default time. Visualize: Used for creating visualizations of the data in your Elasticsearch indices. You can then build dashboards that display related visualizations. Dashboard: Used to display a collection of saved visualizations. Timelion: A time series data visualizer that enables you to combine totally independent data sources within a single visualization. It is based on simple expression  language. Management: A place where you perform your runtime configuration of Kibana, including both the initial setup and ongoing configuration of index patterns, advanced settings that tweak the behaviors of Kibana itself and saved objects. Dev Tools: Contains the console which is based on the Sense plugin and allows you to write Elasticsearch commands in one tab and see the responses of those commands in the other tab. Understanding the Kibana Management screen The Management screen has three tabs available: Index Patterns: For selecting and configuring index names Saved Objects: Where all of your saved visualizations, searches, and dashboards are located Advanced Settings: Contains advanced settings of Kibana: As you can see on the management screen, the very first tab is for Index Patterns. Kibana is asking you to configure an index pattern so that it can load all the mappings and settings from the defined index. It defaults to logstash-*; you can add as many index patterns or absolute index names as you want and can select them while creating the visualization. Since we do have an index already available with the logstash-* pattern, when you click on the Time-field name drop-down list, you will find that it will show you two fields, @timestamp and received_at, which are of the date type, as shown in following screenshot: We will select the @timestamp field and hit the Create button. As soon as you do it, the following screen appears: In the above screenshot, you can see that Kibana has loaded all the mappings from our Logstash index. In addition, you can see three labels in blue (for marking this index as the default), yellow (for reloading the mappings; this is needed if you have updated the mapping after selecting the index pattern), and red (for deleting this index pattern altogether from Kibana). The second tab on the management screen is about saved objects, which contain all of your saved visualizations, searches, and dashboards as you can see in the following screenshot. Please note that you can see the imported dashboards and visualizations from Metricbeat here, which we have done a while ago. The third option is for Advanced Settings and you should not play with the settings shown on this page if you are not aware of the tweaking effects. Discovering data on Kibana When you move to the Discover page of Kibana, you will see a screen similar to the following: Setting the time range and auto-refresh interval Please note that Kibana by default loads the data of the last 15 minutes, which you change by clicking on the clock sign which you can find in the top-right corner of the screen and selecting the desired time range. We have shown it in the following screenshot: One more thing to take look out for is that, after clicking on this clock sign, apart from time- based settings, you will see one more option in the top corner with the name Auto-refresh. This setting tells Kibana how often it needs to query Elasticsearch. When you click on this setting, you will get the option to choose either to completely turn off the auto-refresh or select the desired time interval. Adding fields for exploration and using the search panel As you can see in the following screenshot, you have all your fields available inside your index. On the Visualization screen, by default Kibana shows the timestamp and _source field but you can add your selected fields from the left panel by just moving the cursor on them and then clicking Add. Similarly, if you want to remove the field from the column, just move the cursor to the field's name on the column heading and click on the cross icon. In addition, Kibana also provides you with a search panel in which you can write queries. For example, in the following screenshot, I have searched for the logstash keyword inside the syslog_message field. When you hit the search button, the search text gets highlighted inside the rendered responses: Exploring more options on the Visualization page On Kibana, you will see lots of small arrow signs to open or collapse the sections/settings. You will see one of these arrows in the following image, in the bottom-left corner (I have also added a custom text on the image just beside the arrow): When you click on this arrow, the time series histogram gets hidden and you get to see the following screen, which contains multiple properties such as Table, which contains the histogram data in tabular format; Request, which contains the actual JSON query sent to Elasticsearch; Response, which contains the JSON response returned from Elasticsearch; and Statistics, which shows the query execution time and number of hits matching the query: Using the Dashboard screen to create/load dashboards When you click on the Dashboard panel, you first get a blank screen with some options, such as New for creating a dashboard and Open to open an existing dashboard, along with some more options. If you are creating a dashboard from scratch, you will have to add the built visualizations onto it and then save it using some name. But since we already have a dashboard available which we imported using Metricbeat, we will click Open and you will see something similar to the following screenshot on your Kibana page: Please note that if you do not have Apache installed on your system, selecting the first option, Metricbeat – Apache HTTPD server status, will load a blank dashboard. You can select any other title; for example, if you select the second option, you will see a dashboard similar to the following: Editing an existing visualization When you move the cursor on the visualizations presented on the dashboard, you will notice that a pencil sign appears, as shown in the following screenshot: When you click on that pencil sign, it will open that particular visualization inside the visualization editor panel, as shown in the following screenshot. Here you can edit the properties and either override the same visualization or save it using some other name: Please note that if you want to create a visualization from scratch, just click on the Visualize option on the left-hand side and it will guide you through the steps of creating the visualization. Kibana provides almost 10 types of visualizations. To get the details about working with each type of visualization, please follow the official documentation of Kibana on this link: https://www.elastic.co/guide/en/kibana/master/createvis.html. Using Sense Inside the Dev-Tools option, you can find the console for Kibana, which was previously known as Sense Editor. This is one of the most wonderful tools to help you speed up the learning curve of Elasticsearch since it provides auto-suggestions for all the endpoints and queries, as shown in the following screenshot: You will see that the Kibana Console is divided into two parts; the left part is where you write your queries/requests, and after clicking the green arrow, the response from Elasticsearch is rendered inside the right-hand panel:   To summarize we explained how to work with the Kibana tool in Elasticsearch 5.x. We explored installation of  Kibana, Kibana configuration, and moving ahead with exploring and visualizing data using Kibana. If you enjoyed this excerpt, and want to get an understanding of how you can scale your ElasticSearch cluster to contextualize it and improve its performance, check out the book Mastering Elasticsearch 5.x.    
Read more
  • 0
  • 0
  • 33036

article-image-how-to-execute-a-search-query-in-elasticsearch
Sugandha Lahoti
25 Jan 2018
9 min read
Save for later

How to execute a search query in ElasticSearch

Sugandha Lahoti
25 Jan 2018
9 min read
[box type="note" align="" class="" width=""]This post is an excerpt from a book authored by Alberto Paro, titled Elasticsearch 5.x Cookbook. It has over 170 advance recipes to search, analyze, deploy, manage, and monitor data effectively with Elasticsearch 5.x[/box] In this article we see how to execute and view a search operation in ElasticSearch. Elasticsearch was born as a search engine. It’s main purpose is to process queries and give results. In this article, we'll see that a search in Elasticsearch is not only limited to matching documents, but it can also calculate additional information required to improve the search quality. All the codes in this article are available on PacktPub or GitHub. These are the scripts to initialize all the required data. Getting ready You will need an up-and-running Elasticsearch installation. To execute curl via a command line, you will also need to install curl for your operating system. To correctly execute the following commands you will need an index populated with the chapter_05/populate_query.sh script available in the online code. The mapping used in all the article queries and searches is the following: { "mappings": { "test-type": { "properties": { "pos": { "type": "integer", "store": "yes" }, "uuid": { "store": "yes", "type": "keyword" }, "parsedtext": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text" }, "name": { "term_vector": "with_positions_offsets", "store": "yes", "fielddata": true, "type": "text", "fields": { "raw": { "type": "keyword" } } }, "title": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text", "fielddata": true, "fields": { "raw": { "type": "keyword" } } } } }, "test-type2": { "_parent": { "type": "test-type" } } } } How to do it To execute the search and view the results, we will perform the following steps: From the command line, we can execute a search as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}' In this case, we have used a match_all query that means return all the documents.    If everything works, the command will return the following: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {"position": 2, "parsedtext": "Bill Testere nice guy", "name": "Bill Baloney", "uuid": "22222"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "3", "_score" : 1.0, "_source" : {"position": 3, "parsedtext": "Bill is notn nice guy", "name": "Bill Clinton", "uuid": "33333"} } ] } }    These results contain a lot of information: took is the milliseconds of time required to execute the query. time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results. _shards is the status of shards divided into: total, which is the number of shards. successful, which is the number of shards in which the query was successful. failed, which is the number of shards in which the query failed, because some error or exception occurred during the query. hits are the results which are composed of the following: total is the number of documents that match the query. max_score is the match score of first document. It is usually one, if no match scoring was computed, for example in sorting or filtering. Hits which is a list of result documents. The resulting document has a lot of fields that are always available and others that depend on search parameters. The most important fields are as follows: _index: The index field contains the document _type: The type of the document _id: This is the ID of the document _source(this is the default field returned, but it can be disabled): the document source _score: This is the query score of the document sort: If the document is sorted, values that are used for sorting highlight: Highlighted segments if highlighting was requested fields: Some fields can be retrieved without needing to fetch all the source objects How it works The HTTP method to execute a search is GET (although POST also works); the REST endpoints are as follows: http://<server>/_search http://<server>/<index_name(s)>/_search http://<server>/<index_name(s)>/<type_name(s)>/_search Note: Not all the HTTP clients allow you to send data via a GET call, so the best practice, if you need to send body data, is to use the POST call. Multi indices and types are comma separated. If an index or a type is defined, the search is limited only to them. One or more aliases can be used as index names. The core query is usually contained in the body of the GET/POST call, but a lot of options can also be expressed as URI query parameters, such as the following: q: This is the query string to do simple string queries, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=uuid:11111' df: This is the default field to be used within the query, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? df=uuid&q=11111' from(the default value is 0): The start index of the hits. size(the default value is 10): The number of hits to be returned. analyzer: The default analyzer to be used. default_operator(the default value is OR): This can be set to AND or OR. explain: This allows the user to return information about how the score is calculated, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&explain=true' stored_fields: These allows the user to define fields that must be returned, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&stored_fields=name' sort(the default value is score): This allows the user to change the documents in  order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? sort=name.raw:desc' timeout(not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits accumulated are returned. search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elas ticsearch/reference/current/search-request-search-type.html. track_scores(the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort, because sorting by default prevents the return of a match score. pretty (the default value is false): If true, the results will be pretty printed. Generally, the query, contained in the body of the search, is a JSON object. The body of the search is the core of Elasticsearch's search functionalities; the list of search capabilities extends in every release. For the current version (5.x) of Elasticsearch, the available parameters are as follows: query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios. from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10). Note: The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries. sort: This allows the user to change the order of the matched documents. post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values. _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*) or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ search-request-source-filtering.html). fielddata_fields: This allows the user to return a field data representation of the field. stored_fields: This controls the fields to be returned. Note: Returning only the required fields reduces the network and memory usage, improving the performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources. aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter. index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices. highlighting: This allows the user to define fields and settings to be used for calculating a query abstract. version(the default value false): This adds the version of a document in the results. rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter. min_score: If this is given, all the result documents that have a score lower than this value are rejected. explain: This returns information on how the TD/IF score is calculated for a particular document. script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google- like do you mean functionality. search_type: This defines how Elasticsearch should process a query. scroll: This controls the scrolling in scroll/scan queries. The scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor. _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query. search_after: This allows the user to skip results using the most efficient way of scrolling. preference: This allows the user to select which shard/s to use for executing the query. We saw how to execute a search in ElasticSearch and also learnt about how it works. To know more on how to perform other operations in ElasticSearch check out the book Elasticsearch 5.x Cookbook.  
Read more
  • 0
  • 0
  • 11339
article-image-implementing-principal-component-analysis-r
Amey Varangaonkar
24 Jan 2018
6 min read
Save for later

Implementing Principal Component Analysis with R

Amey Varangaonkar
24 Jan 2018
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box] In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis. Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises of the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Eigenvalue percentage of variance cumulative percentage of variance: Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components.  Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Scree plot If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base) To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages. If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.
Read more
  • 0
  • 0
  • 12216

article-image-implement-named-entity-recognition-ner-using-opennlp-and-java
Pravin Dhandre
22 Jan 2018
5 min read
Save for later

Implement Named Entity Recognition (NER) using OpenNLP and Java

Pravin Dhandre
22 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box] In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results. To begin with, let's understand what Named Entity Recognition (NER) is all about. It is  referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb. In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names. Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences: Jim headed north. Dakota headed south. If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present. Using OpenNLP to perform NER We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java: try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions } An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence: TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names: TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm); To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method: String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence); The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here: Span[] spans = nf.find(tokens); This array holds information about person entities found in the sentence. We then display this information as shown here: for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”: [1..2) person - Wilson [4..5) person - Mary Once these entities have been extracted, we can use them for specialized analysis. Identifying location entities We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model: try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions } With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name: [5..7) location - Oklahoma Suppose we use the following sentence: sentence = "Pond Creek is located north of Oklahoma City."; Then we get this output: [1..2) location - Creek [6..8) location - Oklahoma Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.   With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.  
Read more
  • 0
  • 0
  • 24570

article-image-why-has-vuejs-become-so-popular
Amit Kothari
19 Jan 2018
5 min read
Save for later

Why has Vue.js become so popular?

Amit Kothari
19 Jan 2018
5 min read
The JavaScript ecosystem is full of choices, with many good web development frameworks and libraries to choose from. One of these frameworks is Vue.js, which is gaining a lot of popularity these days. In this post, we’ll explore why you should use Vue.js, and what makes it an attractive option for your next web project. For the latest Vue.js eBooks and videos, visit our Vue.js page. What is Vue.js? Vue.js is a JavaScript framework for building web interfaces. Vue has been gaining a lot of popularity recently. It ranks number one among the 5 web development tools that will matter in 2018. If you take a look at its GitHub page you can see just how popular it has become – the community has grown at an impressive rate. As a modern web framework, Vue ticks a lot of boxes. It uses a virtual DOM for better performance. A virtual DOM is an abstraction of the real DOM; this means it is lightweight and faster to work with. Vue is also reactive and declarative. This is useful because declarative rendering allows you to create visual elements that update automatically based on the state/data changes. One of the most exciting things about Vue is that it supports the component-based approach of building web applications. Its single file components, which are independent and loosely coupled, allow better reuse and faster development. It’s a tool that can significantly impact how you do things. What are the benefits of using Vue.js? Every modern web framework has strong benefits – if they didn’t, no one would use them after all. But here are some of the reasons why Vue.js is a good web framework that can help you tackle many of today’s development challenges. Check out this post to know more on how to install and use Vue.js for web development Good documentation. One of the things that are important when starting with a new framework is its documentation. Vue.js documentation is very well maintained; it includes a simple but comprehensive guide and well-documented APIs. Learning curve. Another thing to look for when picking a new framework is the learning curve involved. Compared to many other frameworks, Vue's concepts and APIs are much simpler and easier to understand. Also, it is built on top of classic web technologies like JavaScript, HTML, and CSS. This results in a much gentler learning curve. Unlike other frameworks which require further knowledge of different technologies - Angular requires TypeScript for example, and React uses JSX, with Vue we can build a sophisticated app by using HTML-based templates, plain JavaScript, and CSS. Less opinionated, more flexible. Vue is also pretty flexible compared to other popular web frameworks. The core library focuses on the ‘view’ part, using a modular approach that allows you to pick your own solution for other issues. While we can use other libraries for things like state management and routing, Vue offers officially supported companion libraries, which are kept up to date with the core library. This includes Vuex, which is an Elm, Flux, and Redux inspired state management solution, and vue-router, Vue's official routing library, which is powerful and incredibly easy to use with Vue.js. But because Vue is so flexible if you wanted to use Redux instead of Vuex, you can do just that. Vue even supports JSX and TypeScript. And if you like taking a CSS-in-JS approach, many other popular libraries also support Vue. Performance. One of the main reasons many teams are using Vue is because of its performance. Vue is small and even with minimal optimization effort performs better than many other frameworks. This is largely due to its lightweight virtual DOM implementation. Check out the JavaScript frameworks performance benchmark for a useful performance comparison. Tools. Along with a number of companion libraries, Vue also offers really good tools that offer a great development experience. Vue-CLI is Vue’s command line tool. Simple yet powerful, it provides different templates, allows project customization and makes starting a new Vue project incredibly easy. Vue also provides its own dev tools for Chrome (vue-devtools), which allows you to inspect the component tree and Vuex state, view events and even time travel. This makes the debugging process pretty easy. Vue also supports hot reload. Hot reload is great because instead of needing to reload a whole page, it allows you to simply reload only the updated component while maintaining the app's current state. Community. No framework can succeed without community support and, as we’ve seen already, Vue has a very active and constantly growing community. The framework is already adopted by many big companies, and its growth is only going to continue. While it is a great option for web development, Vue is also collaborating with Weex, a platform for building cross-platform mobile apps. Weex is backed by the Alibaba group, which is one of the largest e-commerce businesses in the world. Although Weex is not as mature as other app frameworks like React native, it does allow you to build a UI with Vue, which can be rendered natively on iOS and Android. Vue.js offers plenty of benefits. It performs well and is very easy to learn. However, it is, of course important to pick the right tool for the job, and one framework may work better than the other based on the project requirements and personal preferences. With this in mind, it’s worth comparing Vue.js with other frameworks. Are you considering using Vue.js? Do you already use it? Tell us about your experience! You can get started with building your first Vue.js 2 web application from this post.
Read more
  • 0
  • 3
  • 38626
article-image-gitlab-new-devops-solution
Erik Kappelman
17 Jan 2018
5 min read
Save for later

GitLab's new DevOps solution

Erik Kappelman
17 Jan 2018
5 min read
Can it be real? The complete DevOps toolchain integrated into one tool, one UI and one process? GitLab seems to think so. GitLab has already made huge strides in terms of centralizing the DevOps process into a single tool. Up until now, most of the focus has been on creating a seamless development system and operations have not been as important. What’s new is the extension of the tool to include the operating side of DevOps as well as the development side. Let's talk a little bit about what DevOps is in order to fully appreciate the advances offered by GitLab. DevOps is basically a holistic approach to software development, quality assurance, and operations. While each of these elements of software creation is distinct, they are all heavily reliant on the other elements to be effective. The DevOps approach is to acknowledge this interdependence and then try to leverage the interdepence to increase productivity and to enhance the final user experience. Two of the most talked about elements of DevOps are continous integration and continuous deployment. Continuous integration and deployment Continuous integration and deployment are aimed at continuously integrating changes to a codebase, potentially from multiple sources, and then continuously deploying these changes into production. These tools require a pretty sophisticated automation and testing framework in order to be really effective. There are plenty of tools for one or the other, but the notion behind GitLab is essentially that if you can affect both of these processes from the same UI, these processes would be that much more efficient. GitLab has shown this to be true.  There is also the human side to consider, that is, coming up with what tasks need to be performed, assigning these tasks to developers and monitoring their progress. GitLab offers tools that help streamline this process as well. You can track issues, create issue boards to organize workflow and these issue boards can be sliced a number of different ways so that most imaginable human organizational needs can be met. Monitoring and delivery So far, we’ve seen that DevOps is about bringing everything together into a smooth process, and GitLab wants that process to occur in one place. GitLab can help you from planning to deployment and everywhere in between. But, GitLab isn’t satisfied with stopping at deployment, and they shouldn’t be. When we think about the three legs of DevOps, development, operations, and quality assurance and testing, what I’ve said about GitLab really only applies to the development leg. This is an unfortunately common problem with DevOps tools and organizational strategies. They seem to cater to developers and basically no one else. Maybe devs complain the most, I don’t know. GitLab has basically solved the DevOps problems between planning and deployment and, naturally, wants to move on to the monitoring and delivery of applications. This is a really exciting direction. After all, software is ultimately about making things happen. Sometimes it's easy to lose sight of this and only focus on the tools that make the software. It is sometimes tempting to view software development as being inherently important, but it's really not; it's a process of making stuff for people to use. If you get too far away from that truth, things can get sticky. I think this is part of the reason the Ops side of DevOps is often overlooked. Operations is concerned with managing the software out there in the wild. This includes dealing with network and hardware considerations and end users. GitLab wants operations to take place using the same UI as development. And why not? It’s the same application isn’t it? And in addition to technical performance, what about how the users are interacting with the application? If the application is somehow monetized, why shouldn’t that information also be available in the same UI as everything else having to do with this application? Again, it's still the same application. One tool to rule them all If you take a minute to step back and appreciate the vision of GitLab’s current direction, I think you can see why this is so exciting. If GitLab is successful in the long-term of extending out their reach into every element of an application's lifecycle including user interactions, productivity would skyrocket.  This idea isn’t really new. The ‘one tool to rule them all’ isn’t even that imaginative of a concept. It's just that no one has ever really created this ‘one tool.’ I believe we are about to enter, or have already entered, a DevOps space race. I believe GitLab is comfortably leading the pack, but they will need to keep working hard if they want it to stay that way. I believe we will be getting the one tool to rule them all, and I believe it is going to be soon. The way things are looking, GitLab is going to be the one to bring it to us, but only time will tell. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 21302

article-image-running-parallel-data-operations-using-java-streams
Pravin Dhandre
15 Jan 2018
8 min read
Save for later

Running Parallel Data Operations using Java Streams

Pravin Dhandre
15 Jan 2018
8 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.
Read more
  • 0
  • 0
  • 22238
Modal Close icon
Modal Close icon