Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-combine-data-files-within-ibm-spss-modeler
Amey Varangaonkar
22 Feb 2018
6 min read
Save for later

How to combine data files within IBM SPSS Modeler

Amey Varangaonkar
22 Feb 2018
6 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book IBM SPSS Modeler Essentials, written by Keith McCormick and Jesus Salcedo. SPSS Modeler is one of the popularly used enterprise tools for data mining and predictive analytics. [/box] In this article, we will explore how SPSS Modeler can be effectively used to combine different file types for efficient data modeling. In many organizations, different pieces of information for the same individuals are held in separate locations. To be able to analyze such information within Modeler, the data files must be combined into one single file. The Merge node joins two or more data sources so that information held for an individual in different locations can be analyzed collectively. The following diagram shows how the Merge node can be used to combine two separate data files that contain different types of information: Like the Append node, the Merge node is found in the Record Ops palette. This node takes multiple data sources and creates a single source containing all or some of the input fields. Let's go through an example of how to use the Merge node to combine data files: Open the Merge stream. The Merge stream contains the files we previously appended, as well as the main data file we were working with in earlier chapters. 2. Place a Merge node from the Record Ops palette on the canvas. 3. Connect the last Reclassify node to the Merge node. 4. Connect the Filter node to the Merge node. [box type="info" align="" class="" width=""]Like the Append node, the order in which data sources are connected to the Merge node impacts the order in which the sources are displayed. The fields of the first source connected to the Merge node will appear first, followed by the fields of the second source connected to the Merge node, and so on.[/box] 5. Connect the Merge node to a Table node: 6. Edit the Merge node. Since the Merge node can cope with a variety of different situations, the Merge tab allows you to specify the merging method. There are four methods for merging: Order: It joins the first record in the first dataset with the first record in the second dataset, and so on. If any of the datasets run out of records, no further output records are produced. This method can be dangerous if there happens to be any cases that are missing from a file, or if files have been sorted differently. Keys: It is the most commonly used method, used when records that have the same value in the field(s) defined as the key are merged. If multiple records contain the same value on the key field, all possible merges are returned. Condition: It joins records from files that meet a specified condition. Ranked condition: It specifies whether each row pairing in the primary dataset and all secondary datasets are to be merged; use the ranking expression to sort any multiple matches from low to high order. Let's combine these files. To do this: Set Merge Method to Keys. Fields contained in all input sources appear in the Possible keys list. To identify one of more fields as the key field(s), move the selected field into the Keys for merge list. In our case, there are two fields that appear in both files, ID and Year. 2. Select ID in the Possible keys list and place it into the Keys for merge list: There are five major methods of merging using a key field: Include only matching records (inner join) merges only complete records, that is, records; that are available in all datasets. Include matching and non-matching records (full outer join) merges records that appear in any of the datasets; that is, the incomplete records are still retained. The undefined value ($null$) is added to the missing fields and included in the output. Include matching and selected non-matching records (partial outerjoin) performs left and right outer-joins. All records from the specified file are retained, along with only those records from the other file(s) that match records in the specified file on the key field(s). The Select... button allows you to designate which file is to contribute incomplete records. Include records in first dataset not matching any others (anti-join) provides an easy way of identifying records in a dataset that do not have records with the same key values in any of the other datasets involved in the merge. This option only retains records from the dataset that match with no other records. Combine duplicate key fields is the final option in this dialog, and it deals with the problem of duplicate field names (one from each dataset) when key fields are used. This option ensures that there is only one output field with a given name, and this is enabled by default. The Filter tab The Filter tab lists the data sources involved in the merge, and the ordering of the sources determines the field ordering of the merged data. Here, you can rename and remove fields. Earlier, we saw that the field Year appeared in both datasets; here we can remove one version of this field (we could also rename one version of the field to keep both): Click on the arrow next to the second Year field: The second Year field will no longer appear in the combined data file. The Optimization tab The Optimization tab provides two options that allow you to merge data more efficiently when one input dataset is significantly larger than the other datasets, or when the data is already presorted by all or some of the key fields that you are using to merge: Click OK. Run the Table: All of these files have now been combined. The resulting table should have 44 fields and 143,531 records. We saw how the Merge node is used to join data files that contain different information for the same records. If you found this post useful, make sure to check out IBM SPSS Modeler Essentials for more information on leveraging SPSS Modeler to get faster and efficient insights from your data.  
Read more
  • 0
  • 0
  • 11713

article-image-how-to-perform-numeric-metric-aggregations-with-elasticsearch
Pravin Dhandre
22 Feb 2018
7 min read
Save for later

How to perform Numeric Metric Aggregations with Elasticsearch

Pravin Dhandre
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Learning Elastic Stack 6.0 written by Pranav Shukla and Sharath Kumar M N . This book provides detailed coverage on fundamentals of each components of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] Today, we are going to demonstrate how to run numeric and statistical queries such as summation, average, count and various similar metric aggregations on Elastic Stack to serve a better analytics engine on your dataset. Metric aggregations   Metric aggregations work with numeric data, computing one or more aggregate metrics within the given context. The context could be a query, filter, or no query to include the whole index/type. Metric aggregations can also be nested inside other bucket aggregations. In this case, these metrics will be computed for each bucket in the bucket aggregations. We will start with simple metric aggregations without nesting them inside bucket aggregations. When we learn about bucket aggregations later in the chapter, we will also learn how to use metric aggregations inside bucket aggregations. We will learn about the following metric aggregations: Sum, average, min, and max aggregations Stats and extended stats aggregations Cardinality aggregation Let us learn about them one by one. Sum, average, min, and max aggregations Finding the sum of a field, the minimum value for a field, the maximum value for a field, or an average, are very common operations. For the people who are familiar with SQL, the query to find the sum would look like the following: SELECT sum(downloadTotal) FROM usageReport; The preceding query will calculate the sum of the downloadTotal field across all records in the table. This requires going through all records of the table or all records in the given context and adding the values of the given fields. In Elasticsearch, a similar query can be written using the sum aggregation. Let us understand the sum aggregation first. Sum aggregation Here is how to write a simple sum aggregation: GET bigginsight/_search { "aggregations": { 1 "download_sum": { 2 "sum": { 3 "field": "downloadTotal" 4 } } }, "size": 0 5 } The aggs or aggregations element at the top level should wrap any aggregation. Give a name to the aggregation; here we are doing the sum aggregation on the downloadTotal field and hence the name we chose is download_sum. You can name it anything. This field will be useful while looking up this particular aggregation's result in the response. We are doing a sum aggregation, hence the sum element. We want to do term aggregation on the downloadTotal field. Specify size = 0 to prevent raw search results from being returned. We just want aggregation results and not the search results in this case. Since we haven't specified any top level query elements, it matches all documents. We do not want any raw documents (or search hits) in the result. The response should look like the following: { "took": 92, ... "hits": { "total": 242836, 1 "max_score": 0, "hits": [] }, "aggregations": { 2 "download_sum": { 3 "value": 2197438700 4 } } } Let us understand the key aspects of the response. The key parts are numbered 1, 2, 3, and so on, and are explained in the following points: The hits.total element shows the number of documents that were considered or were in the context of the query. If there was no additional query or filter specified, it will include all documents in the type or index. Just like the request, this response is wrapped inside aggregations to indicate as Such. The response of the aggregation requested by us was named download_sum, hence we get our response from the sum aggregation inside an element with the same name. The actual value after applying the sum aggregation. The average, min, and max aggregations are very similar. Let's look at them briefly. Average aggregation The average aggregation finds an average across all documents in the querying context: GET bigginsight/_search { "aggregations": { "download_average": { 1 "avg": { 2 "field": "downloadTotal" } } }, "size": 0 } The only notable differences from the sum aggregation are as follows: We chose a different name, download_average, to make it apparent that the aggregation is trying to compute the average. The type of aggregation that we are doing is avg instead of the sum aggregation that we were doing earlier. The response structure is identical but the value field will now represent the average of the requested field. The min and max aggregations are the exactly same. Min aggregation Here is how we will find the minimum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_min": { "min": { "field": "downloadTotal" } } }, "size": 0 } Let's finally look at max aggregation also. Max aggregation Here is how we will find the maximum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_max": { "max": { "field": "downloadTotal" } } }, "size": 0 } These aggregations were really simple. Now let's look at some more advanced yet simple stats and extended stats aggregations. Stats and extended stats aggregations These aggregations compute some common statistics in a single request without having to issue multiple requests. This saves resources on the Elasticsearch side as well because the statistics are computed in a single pass rather than being requested multiple times. The client code also becomes simpler if you are interested in more than one of these statistics. Let's look at the stats aggregation first. Stats aggregation The stats aggregation computes the sum, average, min, max, and count of documents in a single pass: GET bigginsight/_search { "aggregations": { "download_stats": { "stats": { "field": "downloadTotal" } } }, "size": 0 } The structure of the stats request is the same as the other metric aggregations we have seen so far, so nothing special is going on here. The response should look like the following: { "took": 4, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_stats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700 } } } As you can see, the response with the download_stats element contains count, min, max, average, and sum; everything is included in the same response. This is very handy as it reduces the overhead of multiple requests and also simplifies the client code. Let us look at the extended stats aggregation. Extended stats Aggregation The extended stats aggregation returns a few more statistics in addition to the ones returned by the stats aggregation: GET bigginsight/_search { "aggregations": { "download_estats": { "extended_stats": { "field": "downloadTotal" } } }, "size": 0 } The response looks like the following: { "took": 15, "timed_out": false, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_estats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700, "sum_of_squares": 133545882701698, "variance": 468058704.9782911, "std_deviation": 21634.664429528162, "std_deviation_bounds": { "upper": 52318.43092424462, "lower": -34220.22679386803 } } } } It also returns the sum of squares, variance, standard deviation, and standard deviation Bounds. Cardinality aggregation Finding the count of unique elements can be done with the cardinality aggregation. It is similar to finding the result of a query such as the following: select count(*) from (select distinct username from usageReport) u; Finding the cardinality or the number of unique values for a specific field is a very common requirement. If you have click-stream from the different visitors on your website, you may want to find out how many unique visitors you got in a given day, week, or month. Let us understand how we find out the count of unique users for which we have network traffic data: GET bigginsight/_search { "aggregations": { "unique_visitors": { "cardinality": { "field": "username" } } }, "size": 0 } The cardinality aggregation response is just like the other metric aggregations: { "took": 110, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "unique_visitors": { "value": 79 } } } To summarize, we learned how to perform numerous metric aggregations on numeric datasets and easily deploy elasticsearch in building powerful analytics application. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.      
Read more
  • 0
  • 0
  • 23309

article-image-build-data-streams-ibm-spss-modeler
Amey Varangaonkar
22 Feb 2018
7 min read
Save for later

How to build data streams in IBM SPSS Modeler

Amey Varangaonkar
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book IBM SPSS Modeler Essentials, co-authored by Keith McCormick and Jesus Salcedo. This book gives you a quick overview of the fundamental concepts of data mining and how to put them to practical use with the help of SPSS Modeler.[/box] SPSS Modeler allows users to mine data visually on the stream canvas. This means that you will not be writing code for your data mining projects; instead you will be placing nodes on the stream canvas. Remember that nodes represent operations to be carried out on the data. So once nodes have been placed on the stream canvas, they need to be linked together to form a stream. A stream represents the flow of data going through a number of operations (nodes). The following diagram is an example of nodes on the canvas, as well as a stream: Given that you will spend a lot of time building streams, in this section you will learn the most efficient ways of manipulating nodes to create a stream. Mouse buttons When building streams, mouse buttons are used extensively so that nodes can be brought onto the canvas, connected, edited, and so on. When building streams within Modeler, mouse buttons are used in the following ways: The left button is used for selecting, placing, and positioning nodes on the stream Canvas The right button is used for invoking context (pop-up) menus that allow for editing, connecting, renaming, deleting, and running nodes The middle button (optional) is used for connecting and disconnecting nodes Adding nodes To begin a new stream, a node from the Sources palette needs to be placed on the stream canvas. There are three ways to add nodes to a stream from a palette: Method one: Click on palette and then on stream: Click on the Sources palette. Click on the Var. File node. This will cause the icon to be highlighted. Move the cursor over the stream canvas. Click anywhere in the stream canvas. A copy of the selected node now appears on the stream canvas. This node represents the action of reading data into Modeler from a delimited text data file. If you wish to move the node within the stream canvas, select it by clicking on the node, and while holding the left mouse button down, drag the node to a new position. Method two: Drag and drop: Now go back to the Sources palette. Click on the Statistics File node and drag and drop this node onto the canvas. The Statistics File node represents the action of reading data into Modeler from an IBM SPSS Statistics data file. Method three: Double-click: Go back to the Sources palette one more time. Double click on the Database node. The Database node represents the action of reading data into Modeler from an ODBC compliant database. Editing nodes Once a node has been brought onto the stream canvas, typically at this point you will want to edit the node so that you can specify which fields, cases, or files you want the node to apply to. There are two ways to edit a node. Method one: Right-click on a node: Right-click on the Var. File node: Notice that there are many things you can do within this context menu. You can edit, add comments, copy, delete a node, connect nodes, and so on. Most often you will probably either edit the node or connect nodes. Method two: Double-click on a node: Double-click on the Var. File node. This bypasses the context menu we saw previously, and goes directly into the node itself so we can edit it. Deleting nodes There will be times when you will want to delete a node that you have on the stream canvas. There are two ways to delete a node. Method one: Right-click on a node: Right-click on the Database File node. Select Delete. The node is now deleted. Method two: Use the Delete button from the keyboard: Click on the Statistics File node. Click on the Delete button on the keyboard. Building a stream When two or more nodes have been placed on the stream canvas, they need to be connected to produce a stream. This can be thought of as representing the flow of data through the nodes. To demonstrate this, we will place a Table node on the stream canvas next to the Var. File node. The Table node presents data in a table format. Click the Output palette to activate it. Click on the Table node. Place this node to the right of the Var. File node by clicking in the stream canvas: At this point, we now have two nodes on the stream canvas, however, we technically do not have a stream because the nodes are not speaking to each other (that is, they are not connected). Connecting nodes In order for nodes to work together, they must be connected. Connecting nodes allows you to bring data into Modeler, explore the data, manipulate the data (to either clean it up or create additional fields), build a model, evaluate the model, and ultimately score the data. There are three main ways to connect nodes to form a stream that is, double-clicking, using the middle mouse button, or manually: Method one: Double-click. The simplest way to form a stream is to double-click on nodes on a palette. This method automatically connects the new node to the currently selected node on the stream canvas: Select the Var. File node that is on the stream canvas Double-click the Table node from the Output palette This action automatically connects the Table node to the existing Var. File node, and a connecting arrow appears between the nodes. The head of the arrow indicates the direction of the data flow. Method two: Manually. To manually connect two nodes: Bring a Table node onto the canvas. Right-click on the Var. File node. Select Connect from the context menu. Click the Table node. Method three: Middle mouse button. To use the middle mouse button: Bring a Table node onto the canvas. Use the middle mouse button to click on the Var. File node. While holding the middle mouse button down, drag the cursor over to the Table node. Release the middle mouse button. Deleting connections When you know that you are no longer going to use a node, you can delete it. Often, though, you may not want to delete a node; instead you might want to delete a connection. Deleting a node completely gets rid of the node. Deleting a connection allows you to keep a node with all the edits you have done, but for now the unconnected node will not be part of the stream. Nodes can be disconnected in several ways: Method one: Delete the connecting arrow: Right-click on the connecting arrow. Click Delete Connection. Method two: Right-click on a node: Right-click on one of the nodes that has a connection. Select Disconnect from the Context menu. Method three: Double-clicking: Double-click with the middle mouse button on a node that has a connection. All connections to this node will be severed, but the connections to neighboring nodes will be intact. Thus, we saw it’s fairly easy to build and manage data streams in SPSS Modeler. If you found the above excerpt useful, make sure to check out our book IBM SPSS Modeler Essentials for more tips and tricks on effective data mining.    
Read more
  • 0
  • 0
  • 5473

article-image-implementing-gradient-descent-algorithm-to-solve-optimization-problems
Sunith Shetty
22 Feb 2018
7 min read
Save for later

Implementing gradient descent algorithm to solve optimization problems

Sunith Shetty
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will learn to leverage the power of TensorFlow to train neural networks of varying complexities, without any hassle.[/box] Today we will focus on the gradient descent algorithm and its different variants. We will take a simple example of linear regression to solve the optimization problem. Gradient descent is the most successful optimization algorithm. As mentioned earlier, it is used to do weights updates in a neural network so that we minimize the loss function. Let's now talk about an important neural network method called backpropagation, in which we firstly propagate forward and calculate the dot product of inputs with their corresponding weights, and then apply an activation function to the sum of products which transforms the input to an output and adds non linearities to the model, which enables the model to learn almost any arbitrary functional mappings. Later, we back propagate in the neural network, carrying error terms and updating weights values using gradient descent, as shown in the following graph: Different variants of gradient descent Standard gradient descent, also known as batch gradient descent, will calculate the gradient of the whole dataset but will perform only one update. Therefore, it can be quite slow and tough to control for datasets which are extremely large and don't fit in the memory. Let's now look at algorithms that can solve this problem. Stochastic gradient descent (SGD) performs parameter updates on each training example, whereas mini batch performs an update with n number of training examples in each batch. The issue with SGD is that, due to the frequent updates and fluctuations, it eventually complicates the convergence to the accurate minimum and will keep exceeding due to regular fluctuations. Mini-batch gradient descent comes to the rescue here, which reduces the variance in the parameter update, leading to a much better and stable convergence. SGD and mini-batch are used interchangeably. Overall problems with gradient descent include choosing a proper learning rate so that we avoid slow convergence at small values, or divergence at larger values and applying the same learning rate to all parameter updates wherein if the data is sparse we might not want to update all of them to the same extent. Lastly, is dealing with saddle points. Algorithms to optimize gradient descent We will now be looking at various methods for optimizing gradient descent in order to calculate different learning rates for each parameter, calculate momentum, and prevent decaying learning rates. To solve the problem of high variance oscillation of the SGD, a method called momentum was discovered; this accelerates the SGD by navigating along the appropriate direction and softening the oscillations in irrelevant directions. Basically, it adds a fraction of the update vector of the past step to the current update vector. Momentum value is usually set to .9. Momentum leads to a faster and stable convergence with reduced oscillations. Nesterov accelerated gradient explains that as we reach the minima, that is, the lowest point on the curve, momentum is quite high and it doesn't know to slow down at that point due to the large momentum which could cause it to miss the minima entirely and continue moving up. Nesterov proposed that we first make a long jump based on the previous momentum, then calculate the gradient and then make a correction which results in a parameter update. Now, this update prevents us to go too fast and not miss the minima, and makes it more responsive to changes. Adagrad allows the learning rate to adapt based on the parameters. Therefore, it performs large updates for infrequent parameters and small updates for frequent parameters. Therefore, it is very well-suited for dealing with sparse data. The main flaw is that its learning rate is always decreasing and decaying. Problems with decaying learning rates are solved using AdaDelta. AdaDelta solves the problem of decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as one divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window which allows the sum to decrease. Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today. The following two images (image credit: Alec Radford) show the optimization behavior of optimization algorithms described earlier. We see their behavior on the contours of a loss surface over time. Adagrad, RMsprop, and Adadelta almost quickly head off in the right direction and converge fast, whereas momentum and NAG are headed off-track. NAG is soon able to correct its course due to its improved responsiveness by looking ahead and going to the minimum. The second image displays the behavior of the algorithms at a saddle point. SGD, Momentum, and NAG find it challenging to break symmetry, but slowly they manage to escape the saddle point, whereas Adagrad, Adadelta, and RMsprop head down the negative slope, as can seen from the following image: Which optimizer to choose In the case that the input data is sparse or if we want fast convergence while training complex neural networks, we get the best results using adaptive learning rate methods. We also don't need to tune the learning rate. For most cases, Adam is usually a good choice. Optimization with an example Let's take an example of linear regression, where we try to find the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why we call it least squares regression. Essentially, we are formulating the problem as an optimization problem, where we are trying to minimize a loss function. Let's set up input data and look at the scatter plot: #  input  data xData  =  np.arange(100,  step=.1) yData  =  xData  +  20  *  np.sin(xData/10) Define the data size and batch size: #  define  the  data  size  and  batch  size nSamples  =  1000 batchSize  =  100 We will need to resize the data to meet the TensorFlow input format, as follows: #  resize  input  for  tensorflow xData  =  np.reshape(xData,  (nSamples,  1)) yData  =  np.reshape(yData,  (nSamples,  1)) The following scope initializes the weights and bias, and describes the linear model and loss function: with tf.variable_scope("linear-regression-pipeline"): W  =  tf.get_variable("weights",  (1,1), initializer=tf.random_normal_initializer()) b  =  tf.get_variable("bias",   (1,  ), initializer=tf.constant_initializer(0.0)) # model yPred  =  tf.matmul(X,  W)  +  b # loss  function loss  =  tf.reduce_sum((y  -  yPred)**2/nSamples) We then set optimizers for minimizing the loss: # set the optimizer #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) #optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdadeltaOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdagradOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.MomentumOptimizer(learning_rate=.001, momentum=0.9).minimize(loss) #optimizer = tf.train.FtrlOptimizer(learning_rate=.001).minimize(loss) optimizer = tf.train.RMSPropOptimizer(learning_rate=.001).minimize(loss) We then select the mini batch and run the optimizers errors = [] with tf.Session() as sess: # init variables sess.run(tf.global_variables_initializer()) for _ in range(1000): # select mini batch indices = np.random.choice(nSamples, batchSize) xBatch, yBatch = xData[indices], yData[indices] # run optimizer _, lossVal = sess.run([optimizer, loss], feed_dict={X: xBatch, y: yBatch}) errors.append(lossVal) plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))]) plt.show() plt.savefig("errors.png") The output of the preceding code is as follows: We also get a sliding curve, as follows: We learned optimization is a complicated subject and a lot depends on the nature and size of our data. Also, optimization depends on weight matrices. A lot of these optimizers are trained and tuned for tasks like image classification or predictions. However, for custom or new use cases, we need to perform trial and error to determine the best solution. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.  
Read more
  • 0
  • 0
  • 78900

article-image-working-with-kafka-streams
Amarabha Banerjee
22 Feb 2018
6 min read
Save for later

Working with Kafka Streams

Amarabha Banerjee
22 Feb 2018
6 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will simplify real-time data processing by leveraging Apache Kafka 1.0. In today’s tutorial we are going to discuss how to work with Apache Kafka Streams efficiently. In the data world, a stream is linked to the most important abstractions. A stream depicts a continuously updating and unbounded process. Here, unbounded means unlimited size. By definition, a stream is a fault-tolerant, replayable, and ordered sequence of immutable data records. A data record is defined as a key-value pair. Before we proceed, some concepts need to be defined: Stream processing application: Any program that utilizes the Kafka streams library is known as a stream processing application. Processor topology: This is a topology that defines the computational logic of the data processing that a stream processing application requires to be performed. A topology is a graph of stream processors (nodes) connected by streams (edges).  There are two ways to define a topology: Via the low-level processor API Via the Kafka streams DSL Stream processor: This is a node present in the processor topology. It represents a processing step in a topology and is used to transform data in streams. The standard operations—filter, join, map, and aggregations—are examples of stream processors available in Kafka streams. Windowing: Sometimes, data records are divided into time buckets by a stream processor to window the stream by time. This is usually required for aggregation and join operations. Join: When two or more streams are merged based on the keys of their data records, a new stream is generated. The operation that generates this new stream is called a join. A join over record streams is usually required to be performed on a windowing basis. Aggregation: A new stream is generated by combining multiple input records into a single output record, by taking one input stream. The operation that creates this new stream is known as aggregation. Examples of aggregations are sums and counts. Setting up the project This recipe sets the project to use Kafka streams in the Treu application project. Getting ready The project generated in the first four chapters is needed. How to do it Open the build.gradle file on the Treu project generated in Chapter 4, Message Enrichment, and add these lines: apply plugin: 'java' apply plugin: 'application' sourceCompatibility = '1.8' mainClassName = 'treu.StreamingApp' repositories { mavenCentral() } version = '0.1.0' dependencies { compile 'org.apache.kafka:kafka-clients:1.0.0' compile 'org.apache.kafka:kafka-streams:1.0.0' compile 'org.apache.avro:avro:1.7.7' } jar { manifest { attributes 'Main-Class': mainClassName } from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } } { exclude "META-INF/*.SF" exclude "META-INF/*.DSA" exclude "META-INF/*.RSA" } } To rebuild the app, from the project root directory, run this command: $ gradle jar The output is something like: ... BUILD SUCCESSFUL Total time: 24.234 secs As the next step, create a file called StreamingApp.java in the src/main/java/treu directory with the following contents: package treu; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.kstream.KStream; import java.util.Properties; public class StreamingApp { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streaming_app_id");// 1 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); //2 StreamsConfig config = new StreamsConfig(props); // 3 StreamsBuilder builder = new StreamsBuilder(); //4 Topology topology = builder.build(); KafkaStreams streams = new KafkaStreams(topology, config); KStream<String, String> simpleFirstStream = builder.stream("src-topic"); //5 KStream<String, String> upperCasedStream = simpleFirstStream.mapValues(String::toUpperCase); //6 upperCasedStream.to("out-topic"); //7 System.out.println("Streaming App Started"); streams.start(); Thread.sleep(30000); //8 System.out.println("Shutting down the Streaming App"); streams.close(); } } How it works Follow the comments in the code: In line //1, the APPLICATION_ID_CONFIG is an identifier for the app inside the broker In line //2, the BOOTSTRAP_SERVERS_CONFIG specifies the broker to use In line //3, the StreamsConfig object is created, it is built with the properties specified In line //4, the StreamsBuilder object is created, it is used to build a topology In line //5, when KStream is created, the input topic is specified In line //6, another KStream is created with the contents of the src-topic but in uppercase In line //7, the uppercase stream should write the output to out-topic In line //8, the application will run for 30 seconds Running the streaming application In the previous recipe, the first version of the streaming app was coded. Now, in this recipe, everything is compiled and executed. Getting ready The execution of the previous recipe of this chapter is needed. How to do it The streaming app doesn't receive arguments from the command line: To build the project, from the treu directory, run the following command: $ gradle jar If everything is OK, the output should be: ... BUILD SUCCESSFUL Total time: … To run the project, we have four different command-line windows. The following diagram shows what the arrangement of command-line windows should look like: In the first command-line Terminal, run the control center: $ <confluent-path>/bin/confluent start In the second command-line Terminal, create the two topics needed: $ bin/kafka-topics --create --topic src-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 $ bin/kafka-topics --create --topic out-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 In that command-line Terminal, start the producer: $ bin/kafka-console-producer --broker-list localhost:9092 --topic src-topic This window is where the input messages are typed. In the third command-line Terminal, start a consumer script listening to outtopic: $ bin/kafka-console-consumer --bootstrap-server localhost:9092 -- from-beginning --topic out-topic In the fourth command-line Terminal, start up the processing application. Go the project root directory (where the Gradle jar command was executed) and run: $ java -jar ./build/libs/treu-0.1.0.jar localhost:9092 Go to the second command-line Terminal (console-producer) and send the following three messages (remember to press Enter between messages and execute each one in just one line): $> Hello [Enter] $> Kafka [Enter] $> Streams [Enter] The messages typed in console-producer should appear uppercase in the outtopic console consumer window: > HELLO > KAFKA > STREAMS We discussed about the Apache Kafka streams and how to get up and running with it. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of more useful recipes to work with Apache Kafka installation.  
Read more
  • 0
  • 0
  • 17880

article-image-fat-2018-conference-session-2-summary-interpretability-explainability
Savia Lobo
22 Feb 2018
5 min read
Save for later

FAT* 2018 Conference Session 2 Summary: Interpretability and Explainability

Savia Lobo
22 Feb 2018
5 min read
This session of the FAT* 2018 is about interpretability and explainability in machine learning models. With the advances in Deep learning, machine learning models have become more accurate. However, with accuracy and advancements, it is a tough task to keep the models highly explainable. This means, these models may appear as black boxes to business users, who utilize them without knowing what lies within. Thus, it is equally important to make ML models interpretable and explainable, which can be beneficial and essential for understanding ML models and to have a ‘behind the scenes’ knowledge of what’s happening within them. This understanding can be highly essential for heavily regulated industries like Finance, Medicine, Defence and so on. The Conference on Fairness, Accountability, and Transparency (FAT), which would be held on the 23rd and 24th of February, 2018 is a multi-disciplinary conference that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems. The FAT 2018 conference will witness 17 research papers, 6 tutorials, and 2 keynote presentations from leading experts in the field. This article covers research papers pertaining to the 2nd session that is dedicated to Interpretability and Explainability of machine-learned decisions. If you’ve missed our summary of the 1st session on Online Discrimination and Privacy, visit the article link for a catch up. Paper 1: Meaningful Information and the Right to Explanation This paper addresses an active debate in policy, industry, academia, and the media about whether and to what extent Europe’s new General Data Protection Regulation (GDPR) grants individuals a “right to explanation” of automated decisions. The paper explores two major papers, European Union Regulations on Algorithmic Decision Making and a “Right to Explanation” by Goodman and Flaxman (2017) Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation by Wachter et al. (2017) This paper demonstrates that the specified framework is built on incorrect legal and technical assumptions. In addition to responding to the existing scholarly contributions, the article articulates a positive conception of the right to explanation, located in the text and purpose of the GDPR. The authors take a position that the right should be interpreted functionally, flexibly, and should, at a minimum, enable a data subject to exercise his or her rights under the GDPR and human rights law. Key takeaways: The first paper by Goodman and Flaxman states that GDPR creates a "right to explanation" but without any argument. The second paper is in response to the first paper, where Watcher et al. have published an extensive critique, arguing against the existence of such a right. The current paper, on the other hand, is partially concerned with responding to the arguments of Watcher et al. Paper 2: Interpretable Active Learning The paper tries to highlight how due to complex and opaque ML models, the process of active learning has also become opaque. Not much has been known about what specific trends and patterns, the active learning strategy may be exploring. The paper expands on explaining about LIME (Local Interpretable Model-agnostic Explanations framework) to provide explanations for active learning recommendations. The authors, Richard Phillips, Kyu Hyun Chang, and Sorelle A. Friedler, demonstrate uses of LIME in generating locally faithful explanations for an active learning strategy. Further, the paper shows how these explanations can be used to understand how different models and datasets explore a problem space over time. Key takeaways: The paper demonstrates how active learning choices can be made more interpretable to non-experts. It also discusses techniques that make active learning interpretable to expert labelers, so that queries and query batches can be explained and the uncertainty bias can be tracked via interpretable clusters. It showcases per-query explanations of uncertainty to develop a system that allows experts to choose whether to label a query. This will allow them to incorporate domain knowledge and their own interests into the labeling process. It introduces a quantified notion of uncertainty bias, the idea that an algorithm may be less certain about its decisions on some data clusters than others. Paper 3: Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment Actuarial risk assessments might be unduly perceived as a neutral way to counteract implicit bias and increase the fairness of decisions made within the criminal justice system, from pretrial release to sentencing, parole, and probation. However, recently, these assessments have come under increased scrutiny, as critics claim that the statistical techniques underlying them might reproduce existing patterns of discrimination and historical biases that are reflected in the data. The paper proposes that machine learning should not be used for prediction, but rather to surface covariates that are fed into a causal model for understanding the social, structural and psychological drivers of crime. The authors, Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito (MIT), Jonathan Zittrain (Harvard),  propose an alternative application of machine learning and causal inference away from predicting risk scores to risk mitigation. Key takeaways: The paper gives a brief overview of how risk assessments have evolved from a tool used solely for prediction to one that is diagnostic at its core. The paper places a debate around risk assessment in a broader context. One can get a fuller understanding of the way these actuarial tools have evolved to achieve a varied set of social and institutional agendas. It argues for a shift away from predictive technologies, towards diagnostic methods that will help in understanding the criminogenic effects of the criminal justice system itself, as well as evaluate the effectiveness of interventions designed to interrupt cycles of crime. It proposes that risk assessments, when viewed as a diagnostic tool, can be used to understand the underlying social, economic and psychological drivers of crime. The authors also posit that causal inference offers the best framework for pursuing the goals to achieve a fair and ethical risk assessment tool.
Read more
  • 0
  • 0
  • 29623
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-ipv6-unix-domain-sockets-and-network-interfaces
Packt
21 Feb 2018
38 min read
Save for later

IPv6, Unix Domain Sockets, and Network Interfaces

Packt
21 Feb 2018
38 min read
 In this article, given by Pradeeban Kathiravelu, author of the book Python Network Programming Cookbook - Second Edition, we will cover the following topics: Forwarding a local port to a remote host Pinging hosts on the network with ICMP Waiting for a remote network service Enumerating interfaces on your machine Finding the IP address for a specific interface on your machine Finding whether an interface is up on your machine Detecting inactive machines on your network Performing a basic IPC using connected sockets (socketpair) Performing IPC using Unix domain sockets Finding out if your Python supports IPv6 sockets Extracting an IPv6 prefix from an IPv6 address Writing an IPv6 echo client/server (For more resources related to this topic, see here.) This article extends the use of Python's socket library with a few third-party libraries. It also discusses some advanced techniques, for example, the asynchronous asyncore module from the Python standard library. This article also touches upon various protocols, ranging from an ICMP ping to an IPv6 client/server. In this article, a few useful Python third-party modules have been introduced by some example recipes. For example, the network packet capture library, Scapy, is well known among Python network programmers. A few recipes have been dedicated to explore the IPv6 utilities in Python including an IPv6 client/server. Some other recipes cover Unix domain sockets. Forwarding a local port to a remote host Sometimes, you may need to create a local port forwarder that will redirect all traffic from a local port to a particular remote host. This might be useful to enable proxy users to browse a certain site while preventing them from browsing some others. How to do it... Let us create a local port forwarding script that will redirect all traffic received at port 8800 to the Google home page (http://www.google.com). We can pass the local and remote host as well as port number to this script. For the sake of simplicity, let's only specify the local port number as we are aware that the web server runs on port 80. Listing 3.1 shows a port forwarding example, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse LOCAL_SERVER_HOST = 'localhost' REMOTE_SERVER_HOST = 'www.google.com' BUFSIZE = 4096 import asyncore import socket class PortForwarder(asyncore.dispatcher): def __init__(self, ip, port, remoteip,remoteport,backlog=5): asyncore.dispatcher.__init__(self) self.remoteip=remoteip self.remoteport=remoteport self.create_socket(socket.AF_INET,socket.SOCK_STREAM) self.set_reuse_addr() self.bind((ip,port)) self.listen(backlog) def handle_accept(self): conn, addr = self.accept() print ("Connected to:",addr) Sender(Receiver(conn),self.remoteip,self.remoteport) class Receiver(asyncore.dispatcher): def __init__(self,conn): asyncore.dispatcher.__init__(self,conn) self.from_remote_buffer='' self.to_remote_buffer='' self.sender=None def handle_connect(self): pass def handle_read(self): read = self.recv(BUFSIZE) self.from_remote_buffer += read def writable(self): return (len(self.to_remote_buffer) > 0) def handle_write(self): sent = self.send(self.to_remote_buffer) self.to_remote_buffer = self.to_remote_buffer[sent:] def handle_close(self): self.close() if self.sender: self.sender.close() class Sender(asyncore.dispatcher): def __init__(self, receiver, remoteaddr,remoteport): asyncore.dispatcher.__init__(self) self.receiver=receiver receiver.sender=self self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((remoteaddr, remoteport)) def handle_connect(self): pass def handle_read(self): read = self.recv(BUFSIZE) self.receiver.to_remote_buffer += read def writable(self): return (len(self.receiver.from_remote_buffer) > 0) def handle_write(self): sent = self.send(self.receiver.from_remote_buffer) self.receiver.from_remote_buffer = self.receiver.from_remote_buffer[sent:] def handle_close(self): self.close() self.receiver.close() if __name__ == "__main__": parser = argparse.ArgumentParser(description='Stackless Socket Server Example') parser.add_argument('--local-host', action="store", dest="local_host", default=LOCAL_SERVER_HOST) parser.add_argument('--local-port', action="store", dest="local_port", type=int, required=True) parser.add_argument('--remote-host', action="store", dest="remote_host", default=REMOTE_SERVER_HOST) parser.add_argument('--remote-port', action="store", dest="remote_port", type=int, default=80) given_args = parser.parse_args() local_host, remote_host = given_args.local_host, given_args.remote_host local_port, remote_port = given_args.local_port, given_args.remote_port print ("Starting port forwarding local %s:%s => remote %s:%s" % (local_host, local_port, remote_host, remote_port)) PortForwarder(local_host, local_port, remote_host, remote_port) asyncore.loop() If you run this script, it will show the following output: $ python 3_1_port_forwarding.py --local-port=8800 Starting port forwarding local localhost:8800 => remote www.google.com:80 Now, open your browser and visit http://localhost:8800. This will take you to the Google home page and the script will print something similar to the following command: ('Connected to:', ('127.0.0.1', 37236)) The following screenshot shows the forwarding a local port to a remote host: How it works... We created a port forwarding class, PortForwarder subclassed, from asyncore.dispatcher, which wraps around the socket object. It provides a few additional helpful functions when certain events occur, for example, when the connection is successful or a client is connected to a server socket. You have the choice of overriding the set of methods defined in this class. In our case, we only override the handle_accept() method. Two other classes have been derived from asyncore.dispatcher. The receiver class handles the incoming client requests and the sender class takes this receiver instance and processes the sent data to the clients. As you can see, these two classes override the handle_read(), handle_write(), and writeable() methods to facilitate the bi-directional communication between the remote host and local client. In summary, the PortForwarder class takes the incoming client request in a local socket and passes this to the sender class instance, which in turn uses the receiver class instance to initiate a bi-directional communication with a remote server in the specified port. Pinging hosts on the network with ICMP An ICMP ping is the most common type of network scanning you have ever encountered. It is very easy to open a command-line prompt or terminal and type ping www.google.com. How difficult is that from inside a Python program? This recipe shows you an example of a Python ping. Getting ready You need the superuser or administrator privilege to run this recipe on your machine. How to do it... You can lazily write a Python script that calls the system ping command-line tool, as follows: import subprocess import shlex command_line = "ping -c 1 www.google.com" args = shlex.split(command_line) try: subprocess.check_call(args,stdout=subprocess.PIPE, stderr=subprocess.PIPE) print ("Google web server is up!") except subprocess.CalledProcessError: print ("Failed to get ping.") However, in many circumstances, the system's ping executable may not be available or may be inaccessible. In this case, we need a pure Python script to do that ping. Note that this script needs to be run as a superuser or administrator. Listing 3.2 shows the ICMP ping, as follows: #!/usr/bin/env python # This program is optimized for Python 3.5.2. # Instructions to make it run with Python 2.7.x is as follows. # It may run on any other version with/without modifications. import os import argparse import socket import struct import select import time ICMP_ECHO_REQUEST = 8 # Platform specific DEFAULT_TIMEOUT = 2 DEFAULT_COUNT = 4 class Pinger(object): """ Pings to a host -- the Pythonic way""" def __init__(self, target_host, count=DEFAULT_COUNT, timeout=DEFAULT_TIMEOUT): self.target_host = target_host self.count = count self.timeout = timeout def do_checksum(self, source_string): """ Verify the packet integritity """ sum = 0 max_count = (len(source_string)/2)*2 count = 0 while count < max_count: # To make this program run with Python 2.7.x: # val = ord(source_string[count + 1])*256 + ord(source_string[count]) # ### uncomment the preceding line, and comment out the following line. val = source_string[count + 1]*256 + source_string[count] # In Python 3, indexing a bytes object returns an integer. # Hence, ord() is redundant. sum = sum + val sum = sum & 0xffffffff count = count + 2 if max_count<len(source_string): sum = sum + ord(source_string[len(source_string) - 1]) sum = sum & 0xffffffff sum = (sum >> 16) + (sum & 0xffff) sum = sum + (sum >> 16) answer = ~sum answer = answer & 0xffff answer = answer >> 8 | (answer << 8 & 0xff00) return answer def receive_pong(self, sock, ID, timeout): """ Receive ping from the socket. """ time_remaining = timeout while True: start_time = time.time() readable = select.select([sock], [], [], time_remaining) time_spent = (time.time() - start_time) if readable[0] == []: # Timeout return time_received = time.time() recv_packet, addr = sock.recvfrom(1024) icmp_header = recv_packet[20:28] type, code, checksum, packet_ID, sequence = struct.unpack( "bbHHh", icmp_header ) if packet_ID == ID: bytes_In_double = struct.calcsize("d") time_sent = struct.unpack("d", recv_packet[28:28 + bytes_In_double])[0] return time_received - time_sent time_remaining = time_remaining - time_spent if time_remaining <= 0: return We need a send_ping() method that will send the data of a ping request to the target host. Also, this will call the do_checksum() method for checking the integrity of the ping data, as follows: def send_ping(self, sock, ID): """ Send ping to the target host """ target_addr = socket.gethostbyname(self.target_host) my_checksum = 0 # Create a dummy heder with a 0 checksum. header = struct.pack("bbHHh", ICMP_ECHO_REQUEST, 0, my_checksum, ID, 1) bytes_In_double = struct.calcsize("d") data = (192 - bytes_In_double) * "Q" data = struct.pack("d", time.time()) + bytes(data.encode('utf-8')) # Get the checksum on the data and the dummy header. my_checksum = self.do_checksum(header + data) header = struct.pack( "bbHHh", ICMP_ECHO_REQUEST, 0, socket.htons(my_checksum), ID, 1 ) packet = header + data sock.sendto(packet, (target_addr, 1)) Let us define another method called ping_once() that makes a single ping call to the target host. It creates a raw ICMP socket by passing the ICMP protocol to socket(). The exception handling code takes care if the script is not run by a superuser or if any other socket error occurs. Let's take a look at the following code: def ping_once(self): """ Returns the delay (in seconds) or none on timeout. """ icmp = socket.getprotobyname("icmp") try: sock = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp) except socket.error as e: if e.errno == 1: # Not superuser, so operation not permitted e.msg += "ICMP messages can only be sent from root user processes" raise socket.error(e.msg) except Exception as e: print ("Exception: %s" %(e)) my_ID = os.getpid() & 0xFFFF self.send_ping(sock, my_ID) delay = self.receive_pong(sock, my_ID, self.timeout) sock.close() return delay The main executive method of this class is ping(). It runs a for loop inside which the ping_once() method is called count times and receives a delay in the ping response in seconds. If no delay is returned, that means the ping has failed. Let's take a look at the following code: def ping(self): """ Run the ping process """ for i in range(self.count): print ("Ping to %s..." % self.target_host,) try: delay = self.ping_once() except socket.gaierror as e: print ("Ping failed. (socket error: '%s')" % e[1]) break if delay == None: print ("Ping failed. (timeout within %ssec.)" % self.timeout) else: delay = delay * 1000 print ("Get pong in %0.4fms" % delay) if __name__ == '__main__': parser = argparse.ArgumentParser(description='Python ping') parser.add_argument('--target-host', action="store", dest="target_host", required=True) given_args = parser.parse_args() target_host = given_args.target_host pinger = Pinger(target_host=target_host) pinger.ping() This script shows the following output. This has been run with the superuser privilege: $ sudo python 3_2_ping_remote_host.py --target-host=www.google.com Ping to www.google.com... Get pong in 27.0808ms Ping to www.google.com... Get pong in 17.3445ms Ping to www.google.com... Get pong in 33.3586ms Ping to www.google.com... Get pong in 32.3212ms How it works... A Pinger class has been constructed to define a few useful methods. The class initializes with a few user-defined or default inputs, which are as follows: target_host: This is the target host to ping count: This is how many times to do the ping timeout: This is the value that determines when to end an unfinished ping operation The send_ping() method gets the DNS hostname of the target host and creates an ICMP_ECHO_REQUEST packet using the struct module. It is necessary to check the data integrity of the method using the do_checksum() method. It takes the source string and manipulates it to produce a proper checksum. On the receiving end, the receive_pong() method waits for a response until the timeout occurs or receives the response. It captures the ICMP response header and then compares the packet ID and calculates the delay in the request and response cycle. Waiting for a remote network service Sometimes, during the recovery of a network service, it might be useful to run a script to check when the server is online again. How to do it... We can write a client that will wait for a particular network service forever or for a timeout. In this example, by default, we would like to check when a web server is up in localhost. If you specified some other remote host or port, that information will be used instead. Listing 3.3 shows waiting for a remote network service, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse import socket import errno from time import time as now DEFAULT_TIMEOUT = 120 DEFAULT_SERVER_HOST = 'localhost' DEFAULT_SERVER_PORT = 80 class NetServiceChecker(object): """ Wait for a network service to come online""" def __init__(self, host, port, timeout=DEFAULT_TIMEOUT): self.host = host self.port = port self.timeout = timeout self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) def end_wait(self): self.sock.close() def check(self): """ Check the service """ if self.timeout: end_time = now() + self.timeout while True: try: if self.timeout: next_timeout = end_time - now() if next_timeout < 0: return False else: print ("setting socket next timeout %ss" %round(next_timeout)) self.sock.settimeout(next_timeout) self.sock.connect((self.host, self.port)) # handle exceptions except socket.timeout as err: if self.timeout: return False except socket.error as err: print ("Exception: %s" %err) else: # if all goes well self.end_wait() return True if __name__ == '__main__': parser = argparse.ArgumentParser(description='Wait for Network Service') parser.add_argument('--host', action="store", dest="host", default=DEFAULT_SERVER_HOST) parser.add_argument('--port', action="store", dest="port", type=int, default=DEFAULT_SERVER_PORT) parser.add_argument('--timeout', action="store", dest="timeout", type=int, default=DEFAULT_TIMEOUT) given_args = parser.parse_args() host, port, timeout = given_args.host, given_args.port, given_args.timeout service_checker = NetServiceChecker(host, port, timeout=timeout) print ("Checking for network service %s:%s ..." %(host, port)) if service_checker.check(): print ("Service is available again!") If a web server is running on your machine, this script will show the following output: $ python 3_3_wait_for_remote_service.py Waiting for network service localhost:80 ... setting socket next timeout 120.0s Service is available again! If you do not have a web server already running in your computer, make sure to install one such as Apache 2 Web Server: $ sudo apt install apache2 Now, stop the Apache process: $ sudo /etc/init.d/apache2 stop It will print the following message while stopping the service. [ ok ] Stopping apache2 (via systemctl): apache2.service. Run this script, and start Apache again. $ sudo /etc/init.d/apache2 start[ ok ] Starting apache2 (via systemctl): apache2.service. The output pattern will be different. On my machine, the following output pattern was found: Exception: [Errno 103] Software caused connection abort setting socket next timeout 119.0s Exception: [Errno 111] Connection refused setting socket next timeout 119.0s Exception: [Errno 103] Software caused connection abort setting socket next timeout 119.0s Exception: [Errno 111] Connection refused setting socket next timeout 119.0s And finally when Apache2 is up again, the following log is printed: Service is available again! The following screenshot shows the waiting for an active Apache web server process: How it works... The preceding script uses the argparse module to take the user input and process the hostname, port, and timeout, that is how long our script will wait for the desired network service. It launches an instance of the NetServiceChecker class and calls the check() method. This method calculates the final end time of waiting and uses the socket's settimeout() method to control each round's end time, that is next_timeout. It then uses the socket's connect() method to test if the desired network service is available until the socket timeout occurs. This method also catches the socket timeout error and checks the socket timeout against the timeout values given by the user. Enumerating interfaces on your machine If you need to list the network interfaces present on your machine, it is not very complicated in Python. There are a couple of third-party libraries out there that can do this job in a few lines. However, let's see how this is done using a pure socket call. Getting ready You need to run this recipe on a Linux box. To get the list of available interfaces, you can execute the following command: $ /sbin/ifconfig How to do it... Listing 3.4 shows how to list the networking interfaces, as follows: #!/usr/bin/env python # Python Network Programming Cookbook, Second Edition -- Article - 3 # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import sys import socket import fcntl import struct import array SIOCGIFCONF = 0x8912 #from C library sockios.h STUCT_SIZE_32 = 32 STUCT_SIZE_64 = 40 PLATFORM_32_MAX_NUMBER = 2**32 DEFAULT_INTERFACES = 8 def list_interfaces(): interfaces = [] max_interfaces = DEFAULT_INTERFACES is_64bits = sys.maxsize > PLATFORM_32_MAX_NUMBER struct_size = STUCT_SIZE_64 if is_64bits else STUCT_SIZE_32 sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) while True: bytes = max_interfaces * struct_size interface_names = array.array('B', b' ' * bytes) sock_info = fcntl.ioctl( sock.fileno(), SIOCGIFCONF, struct.pack('iL', bytes, interface_names.buffer_info()[0]) ) outbytes = struct.unpack('iL', sock_info)[0] if outbytes == bytes: max_interfaces *= 2 else: break namestr = interface_names.tostring() for i in range(0, outbytes, struct_size): interfaces.append((namestr[i:i+16].split(b' ', 1)[0]).decode('ascii', 'ignore')) return interfaces if __name__ == '__main__': interfaces = list_interfaces() print ("This machine has %s network interfaces: %s." %(len(interfaces), interfaces)) The preceding script will list the network interfaces, as shown in the following output: $ python 3_4_list_network_interfaces.py This machine has 2 network interfaces: ['lo', 'wlo1']. How it works... This recipe code uses a low-level socket feature to find out the interfaces present on the system. The single list_interfaces()method creates a socket object and finds the network interface information from manipulating this object. It does so by making a call to the fnctl module's ioctl() method. The fnctl module interfaces with some Unix routines, for example, fnctl(). This interface performs an I/O control operation on the underlying file descriptor socket, which is obtained by calling the fileno() method of the socket object. The additional parameter of the ioctl() method includes the SIOCGIFADDR constant defined in the C socket library and a data structure produced by the struct module's pack() function. The memory address specified by a data structure is modified as a result of the ioctl() call. In this case, the interface_names variable holds this information. After unpacking the sock_info return value of the ioctl() call, the number of network interfaces is increased twice if the size of the data suggests it. This is done in a while loop to discover all interfaces if our initial interface count assumption is not correct. The names of interfaces are extracted from the string format of the interface_names variable. It reads specific fields of that variable and appends the values in the interfaces' list. At the end of the list_interfaces() function, this is returned. Finding the IP address for a specific interface on your machine Finding the IP address of a particular network interface may be needed from your Python network application. Getting ready This recipe is prepared exclusively for a Linux box. There are some Python modules specially designed to bring similar functionalities on Windows and Mac platforms. For example, see http://sourceforge.net/projects/pywin32/ for Windows-specific implementation. How to do it... You can use the fnctl module to query the IP address on your machine. Listing 3.5 shows us how to find the IP address for a specific interface on your machine, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse import sys import socket import fcntl import struct import array def get_ip_address(ifname): s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) return socket.inet_ntoa(fcntl.ioctl( s.fileno(), 0x8915, # SIOCGIFADDR struct.pack(b'256s', bytes(ifname[:15], 'utf-8')) )[20:24]) if __name__ == '__main__': parser = argparse.ArgumentParser(description='Python networking utils') parser.add_argument('--ifname', action="store", dest="ifname", required=True) given_args = parser.parse_args() ifname = given_args.ifname print ("Interface [%s] --> IP: %s" %(ifname, get_ip_address(ifname))) The output of this script is shown in one line, as follows: $ python 3_5_get_interface_ip_address.py --ifname=lo Interface [lo] --> IP: 127.0.0.1 In the preceding execution, make sure to use an existing interface, as printed in the previous recipe. In my computer, I got the output previously for3_4_list_network_interfaces.py: This machine has 2 network interfaces: ['lo', 'wlo1']. If you use a non-existing interface, an error will be printed. For example, I do not have eth0 interface right now.So the output is, $ python3 3_5_get_interface_ip_address.py --ifname=eth0 Traceback (most recent call last): File "3_5_get_interface_ip_address.py", line 27, in <module> print ("Interface [%s] --> IP: %s" %(ifname, get_ip_address(ifname))) File "3_5_get_interface_ip_address.py", line 19, in get_ip_address struct.pack(b'256s', bytes(ifname[:15], 'utf-8')) OSError: [Errno 19] No such device How it works... This recipe is similar to the previous one. The preceding script takes a command-line argument: the name of the network interface whose IP address is to be known. The get_ip_address() function creates a socket object and calls the fnctl.ioctl() function to query on that object about IP information. Note that the socket.inet_ntoa() function converts the binary data to a human-readable string in a dotted format as we are familiar with it. Finding whether an interface is up on your machine If you have multiple network interfaces on your machine, before doing any work on a particular interface, you would like to know the status of that network interface, for example, if the interface is actually up. This makes sure that you route your command to active interfaces. Getting ready This recipe is written for a Linux machine. So, this script will not run on a Windows or Mac host. In this recipe, we use nmap, a famous network scanning tool. You can find more about nmap from its website http://nmap.org/. Install nmap in your computer. For Debian-based system, the command is: $ sudo apt-get install nmap You also need the python-nmap module to run this recipe. This can be installed by pip,  as follows: $ pip install python-nmap How to do it... We can create a socket object and get the IP address of that interface. Then, we can use any of the scanning techniques to probe the interface status. Listing 3.6 shows the detect network interface status, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse import socket import struct import fcntl import nmap SAMPLE_PORTS = '21-23' def get_interface_status(ifname): sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) ip_address = socket.inet_ntoa(fcntl.ioctl( sock.fileno(), 0x8915, #SIOCGIFADDR, C socket library sockios.h struct.pack(b'256s', bytes(ifname[:15], 'utf-8')) )[20:24]) nm = nmap.PortScanner() nm.scan(ip_address, SAMPLE_PORTS) return nm[ip_address].state() if __name__ == '__main__': parser = argparse.ArgumentParser(description='Python networking utils') parser.add_argument('--ifname', action="store", dest="ifname", required=True) given_args = parser.parse_args() ifname = given_args.ifname print ("Interface [%s] is: %s" %(ifname, get_interface_status(ifname))) If you run this script to inquire the status of the eth0 status, it will show something similar to the following output: $ python 3_6_find_network_interface_status.py --ifname=lo Interface [lo] is: up How it works... The recipe takes the interface's name from the command line and passes it to the get_interface_status() function. This function finds the IP address of that interface by manipulating a UDP socket object. This recipe needs the nmap third-party module. We can install that PyPI using the pip install command. The nmap scanning instance, nm, has been created by calling PortScanner(). An initial scan to a local IP address gives us the status of the associated network interface. Detecting inactive machines on your network If you have been given a list of IP addresses of a few machines on your network and you are asked to write a script to find out which hosts are inactive periodically, you would want to create a network scanner type program without installing anything on the target host computers. Getting ready This recipe requires installing the Scapy library (> 2.2), which can be obtained at http://www.secdev.org/projects/scapy/files/scapy-latest.zip. At the time of writing, the default Scapy release works with Python 2, and does not support Python 3. You may download the Scapy for Python 3 from https://pypi.python.org/pypi/scapy-python3/0.20 How to do it... We can use Scapy, a mature network-analyzing, third-party library, to launch an ICMP scan. Since we would like to do it periodically, we need Python's sched module to schedule the scanning tasks. Listing 3.7 shows us how to detect inactive machines, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. # Requires scapy-2.2.0 or higher for Python 2.7. # Visit: http://www.secdev.org/projects/scapy/files/scapy-latest.zip # As of now, requires a separate bundle for Python 3.x. # Download it from: https://pypi.python.org/pypi/scapy-python3/0.20 import argparse import time import sched from scapy.all import sr, srp, IP, UDP, ICMP, TCP, ARP, Ether RUN_FREQUENCY = 10 scheduler = sched.scheduler(time.time, time.sleep) def detect_inactive_hosts(scan_hosts): """ Scans the network to find scan_hosts are live or dead scan_hosts can be like 10.0.2.2-4 to cover range. See Scapy docs for specifying targets. """ global scheduler scheduler.enter(RUN_FREQUENCY, 1, detect_inactive_hosts, (scan_hosts, )) inactive_hosts = [] try: ans, unans = sr(IP(dst=scan_hosts)/ICMP(), retry=0, timeout=1) ans.summary(lambda r : r.sprintf("%IP.src% is alive")) for inactive in unans: print ("%s is inactive" %inactive.dst) inactive_hosts.append(inactive.dst) print ("Total %d hosts are inactive" %(len(inactive_hosts))) except KeyboardInterrupt: exit(0) if __name__ == "__main__": parser = argparse.ArgumentParser(description='Python networking utils') parser.add_argument('--scan-hosts', action="store", dest="scan_hosts", required=True) given_args = parser.parse_args() scan_hosts = given_args.scan_hosts scheduler.enter(1, 1, detect_inactive_hosts, (scan_hosts, )) scheduler.run() The output of this script will be something like the following command: $ sudo python 3_7_detect_inactive_machines.py --scan-hosts=10.0.2.2-4 Begin emission: *.Finished to send 3 packets. . Received 6 packets, got 1 answers, remaining 2 packets 10.0.2.2 is alive 10.0.2.4 is inactive 10.0.2.3 is inactive Total 2 hosts are inactive Begin emission: *.Finished to send 3 packets. Received 3 packets, got 1 answers, remaining 2 packets 10.0.2.2 is alive 10.0.2.4 is inactive 10.0.2.3 is inactive Total 2 hosts are inactive How it works... The preceding script first takes a list of network hosts, scan_hosts, from the command line. It then creates a schedule to launch the detect_inactive_hosts() function after a one-second delay. The target function takes the scan_hosts argument and calls Scapy's sr() function. This function schedules itself to rerun after every 10 seconds by calling the schedule.enter() function once again. This way, we run this scanning task periodically. Scapy's sr() scanning function takes an IP, protocol and some scan-control information. In this case, the IP() method passes scan_hosts as the destination hosts to scan, and the protocol is specified as ICMP. This can also be TCP or UDP. We do not specify a retry and one-second timeout to run this script faster. However, you can experiment with the options that suit you. The scanning sr()function returns the hosts that answer and those that don't as a tuple. We check the hosts that don't answer, build a list, and print that information. Performing a basic IPC using connected sockets (socketpair) Sometimes, two scripts need to communicate some information between themselves via two processes. In Unix/Linux, there's a concept of connected socket, of socketpair. We can experiment with this here. Getting ready This recipe is designed for a Unix/Linux host. Windows/Mac is not suitable for running this one. How to do it... We use a test_socketpair() function to wrap a few lines that test the socket's socketpair() function. List 3.8 shows an example of socketpair, as follows: #!/usr/bin/env python # This program is optimized for Python 3.5.2. # It may run on any other version with/without modifications. # To make it run on Python 2.7.x, needs some changes due to API differences. # Follow the comments inline to make the program work with Python 2. import socket import os BUFSIZE = 1024 def test_socketpair(): """ Test Unix socketpair""" parent, child = socket.socketpair() pid = os.fork() try: if pid: print ("@Parent, sending message...") child.close() parent.sendall(bytes("Hello from parent!", 'utf-8')) # Comment out the preceding line and uncomment the following line for Python 2.7. # parent.sendall("Hello from parent!") response = parent.recv(BUFSIZE) print ("Response from child:", response) parent.close() else: print ("@Child, waiting for message from parent") parent.close() message = child.recv(BUFSIZE) print ("Message from parent:", message) child.sendall(bytes("Hello from child!!", 'utf-8')) # Comment out the preceding line and uncomment the following line for Python 2.7. # child.sendall("Hello from child!!") child.close() except Exception as err: print ("Error: %s" %err) if __name__ == '__main__': test_socketpair() The output from the preceding script is as follows: $ python 3_8_ipc_using_socketpairs.py @Parent, sending message... @Child, waiting for message from parent Message from parent: b'Hello from parent!' Response from child: b'Hello from child!!' How it works... The socket.socketpair() function simply returns two connected socket objects. In our case, we can say that one is a parent and another is a child. We fork another process via a os.fork() call. This returns the process ID of the parent. In each process, the other process' socket is closed first and then a message is exchanged via a sendall() method call on the process's socket. The try-except block prints any error in case of any kind of exception. Performing IPC using Unix domain sockets Unix domain sockets (UDS) are sometimes used as a convenient way to communicate between two processes. As in Unix, everything is conceptually a file. If you need an example of such an IPC action, this can be useful. How to do it... We launch a UDS server that binds to a filesystem path, and a UDS client uses the same path to communicate with the server. Listing 3.9a shows a Unix domain socket server, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import socket import os import time SERVER_PATH = "/tmp/python_unix_socket_server" def run_unix_domain_socket_server(): if os.path.exists(SERVER_PATH): os.remove( SERVER_PATH ) print ("starting unix domain socket server.") server = socket.socket( socket.AF_UNIX, socket.SOCK_DGRAM ) server.bind(SERVER_PATH) print ("Listening on path: %s" %SERVER_PATH) while True: datagram = server.recv( 1024 ) if not datagram: break else: print ("-" * 20) print (datagram) if "DONE" == datagram: break print ("-" * 20) print ("Server is shutting down now...") server.close() os.remove(SERVER_PATH) print ("Server shutdown and path removed.") if __name__ == '__main__': run_unix_domain_socket_server() Listing 3.9b shows a UDS client, as follows: #!/usr/bin/env python # Python Network Programming Cookbook, Second Edition -- Article - 3 # This program is optimized for Python 3.5.2. # It may run on any other version with/without modifications. # To make it run on Python 2.7.x, needs some changes due to API differences. # Follow the comments inline to make the program work with Python 2. import socket import sys SERVER_PATH = "/tmp/python_unix_socket_server" def run_unix_domain_socket_client(): """ Run "a Unix domain socket client """ sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM) # Connect the socket to the path where the server is listening server_address = SERVER_PATH print ("connecting to %s" % server_address) try: sock.connect(server_address) except socket.error as msg: print (msg) sys.exit(1) try: message = "This is the message. This will be echoed back!" print ("Sending [%s]" %message) sock.sendall(bytes(message, 'utf-8')) # Comment out the preceding line and uncomment the bfollowing line for Python 2.7. # sock.sendall(message) amount_received = 0 amount_expected = len(message) while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print ("Received [%s]" % data) finally: print ("Closing client") sock.close() if __name__ == '__main__': run_unix_domain_socket_client() The server output is as follows: $ python 3_9a_unix_domain_socket_server.py starting unix domain socket server. Listening on path: /tmp/python_unix_socket_server -------------------- This is the message. This will be echoed back! The client output is as follows: $ python 3_9b_unix_domain_socket_client.py connecting to /tmp/python_unix_socket_server Sending [This is the message. This will be echoed back!] How it works... A common path is defined for a UDS client/server to interact. Both the client and server use the same path to connect and listen to. In a server code, we remove the path if it exists from the previous run of this script. It then creates a Unix datagram socket and binds it to the specified path. It then listens for incoming connections. In the data processing loop, it uses the recv() method to get data from the client and prints that information on screen. The client-side code simply opens a Unix datagram socket and connects to the shared server address. It sends a message to the server using sendall(). It then waits for the message to be echoed back to itself and prints that message. Finding out if your Python supports IPv6 sockets IP version 6 or IPv6 is increasingly adopted by the industry to build newer applications. In case you would like to write an IPv6 application, the first thing you'd like to know is if your machine supports IPv6. This can be done from the Linux/Unix command line, as follows: $ cat /proc/net/if_inet6 00000000000000000000000000000001 01 80 10 80 lo fe80000000000000642a57c2e51932a2 03 40 20 80 wlo1 From your Python script, you can also check if the IPv6 support is present on your machine, and Python is installed with that support. Getting ready For this recipe, use pip to install a Python third-party library, netifaces, as follows: $ pip install netifaces How to do it... We can use a third-party library, netifaces, to find out if there is IPv6 support on your machine. We can call the interfaces() function from this library to list all interfaces present in the system. Listing 3.10 shows the Python IPv6 support checker, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. # This program depends on Python module netifaces => 0.8 import socket import argparse import netifaces as ni def inspect_ipv6_support(): """ Find the ipv6 address""" print ("IPV6 support built into Python: %s" %socket.has_ipv6) ipv6_addr = {} for interface in ni.interfaces(): all_addresses = ni.ifaddresses(interface) print ("Interface %s:" %interface) for family,addrs in all_addresses.items(): fam_name = ni.address_families[family] print (' Address family: %s' % fam_name) for addr in addrs: if fam_name == 'AF_INET6': ipv6_addr[interface] = addr['addr'] print (' Address : %s' % addr['addr']) nmask = addr.get('netmask', None) if nmask: print (' Netmask : %s' % nmask) bcast = addr.get('broadcast', None) if bcast: print (' Broadcast: %s' % bcast) if ipv6_addr: print ("Found IPv6 address: %s" %ipv6_addr) else: print ("No IPv6 interface found!") if __name__ == '__main__': inspect_ipv6_support() The output from this script will be as follows: $ python 3_10_check_ipv6_support.py IPV6 support built into Python: True Interface lo: Address family: AF_PACKET Address : 00:00:00:00:00:00 Address family: AF_INET Address : 127.0.0.1 Netmask : 255.0.0.0 Address family: AF_INET6 Address : ::1 Netmask : ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff/128 Interface enp2s0: Address family: AF_PACKET Address : 9c:5c:8e:26:a2:48 Broadcast: ff:ff:ff:ff:ff:ff Address family: AF_INET Address : 130.104.228.90 Netmask : 255.255.255.128 Broadcast: 130.104.228.127 Address family: AF_INET6 Address : 2001:6a8:308f:2:88bc:e3ec:ace4:3afb Netmask : ffff:ffff:ffff:ffff::/64 Address : 2001:6a8:308f:2:5bef:e3e6:82f8:8cca Netmask : ffff:ffff:ffff:ffff::/64 Address : fe80::66a0:7a3f:f8e9:8c03%enp2s0 Netmask : ffff:ffff:ffff:ffff::/64 Interface wlp1s0: Address family: AF_PACKET Address : c8:ff:28:90:17:d1 Broadcast: ff:ff:ff:ff:ff:ff Found IPv6 address: {'lo': '::1', 'enp2s0': 'fe80::66a0:7a3f:f8e9:8c03%enp2s0'} How it works... The IPv6 support checker function, inspect_ipv6_support(), first checks if Python is built with IPv6 using socket.has_ipv6. Next, we call the interfaces() function from the netifaces module. This gives us the list of all interfaces. If we call the ifaddresses() method by passing a network interface to it, we can get all the IP addresses of this interface. We then extract various IP-related information, such as protocol family, address, netmask, and broadcast address. Then, the address of a network interface has been added to the IPv6_address dictionary if its protocol family matches AF_INET6. Extracting an IPv6 prefix from an IPv6 address In your IPv6 application, you need to dig out the IPv6 address for getting the prefix information. Note that the upper 64-bits of an IPv6 address are represented from a global routing prefix plus a subnet ID, as defined in RFC 3513. A general prefix (for example, /48) holds a short prefix based on which a number of longer, more specific prefixes (for example, /64) can be defined. A Python script can be very helpful in generating the prefix information. How to do it... We can use the netifaces and netaddr third-party libraries to find out the IPv6 prefix information for a given IPv6 address. Make sure to have netifaces and netaddr installed in your system. $ pip install netaddr The program is as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. # This program depends on Python modules netifaces and netaddr. import socket import netifaces as ni import netaddr as na def extract_ipv6_info(): """ Extracts IPv6 information""" print ("IPv6 support built into Python: %s" %socket.has_ipv6) for interface in ni.interfaces(): all_addresses = ni.ifaddresses(interface) print ("Interface %s:" %interface) for family,addrs in all_addresses.items(): fam_name = ni.address_families[family] for addr in addrs: if fam_name == 'AF_INET6': addr = addr['addr'] has_eth_string = addr.split("%eth") if has_eth_string: addr = addr.split("%eth")[0] try: print (" IP Address: %s" %na.IPNetwork(addr)) print (" IP Version: %s" %na.IPNetwork(addr).version) print (" IP Prefix length: %s" %na.IPNetwork(addr).prefixlen) print (" Network: %s" %na.IPNetwork(addr).network) print (" Broadcast: %s" %na.IPNetwork(addr).broadcast) except Exception as e: print ("Skip Non-IPv6 Interface") if __name__ == '__main__': extract_ipv6_info() The output from this script is as follows: $ python 3_11_extract_ipv6_prefix.py IPv6 support built into Python: True Interface lo: IP Address: ::1/128 IP Version: 6 IP Prefix length: 128 Network: ::1 Broadcast: ::1 Interface enp2s0: IP Address: 2001:6a8:308f:2:88bc:e3ec:ace4:3afb/128 IP Version: 6 IP Prefix length: 128 Network: 2001:6a8:308f:2:88bc:e3ec:ace4:3afb Broadcast: 2001:6a8:308f:2:88bc:e3ec:ace4:3afb IP Address: 2001:6a8:308f:2:5bef:e3e6:82f8:8cca/128 IP Version: 6 IP Prefix length: 128 Network: 2001:6a8:308f:2:5bef:e3e6:82f8:8cca Broadcast: 2001:6a8:308f:2:5bef:e3e6:82f8:8cca Skip Non-IPv6 Interface Interface wlp1s0: How it works... Python's netifaces module gives us the network interface IPv6 address. It uses the interfaces() and ifaddresses() functions for doing this. The netaddr module is particularly helpful to manipulate a network address. It has a IPNetwork() class that provides us with an address, IPv4 or IPv6, and computes the prefix, network, and broadcast addresses. Here, we find this information class instance's version, prefixlen, and network and broadcast attributes. Writing an IPv6 echo client/server You need to write an IPv6 compliant server or client and wonder what could be the differences between an IPv6 compliant server or client and its IPv4 counterpart. How to do it... We use the same approach as writing an echo client/server using IPv6. The only major difference is how the socket is created using IPv6 information. Listing 12a shows an IPv6 echo server, as follows: #!/usr/bin/env python # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse import socket import sys HOST = 'localhost' def echo_server(port, host=HOST): """Echo server using IPv6 """ for result in socket.getaddrinfo(host, port, socket.AF_UNSPEC, socket.SOCK_STREAM, 0, socket.AI_PASSIVE): af, socktype, proto, canonname, sa = result try: sock = socket.socket(af, socktype, proto) except socket.error as err: print ("Error: %s" %err) try: sock.bind(sa) sock.listen(1) print ("Server lisenting on %s:%s" %(host, port)) except socket.error as msg: sock.close() continue break sys.exit(1) conn, addr = sock.accept() print ('Connected to', addr) while True: data = conn.recv(1024) print ("Received data from the client: [%s]" %data) if not data: break conn.send(data) print ("Sent data echoed back to the client: [%s]" %data) conn.close() if __name__ == '__main__': parser = argparse.ArgumentParser(description='IPv6 Socket Server Example') parser.add_argument('--port', action="store", dest="port", type=int, required=True) given_args = parser.parse_args() port = given_args.port echo_server(port) Listing 12b shows an IPv6 echo client, as follows: #!/usr/bin/env python # Python Network Programming Cookbook, Second Edition -- Article - 3 # This program is optimized for Python 2.7.12 and Python 3.5.2. # It may run on any other version with/without modifications. import argparse import socket import sys HOST = 'localhost' BUFSIZE = 1024 def ipv6_echo_client(port, host=HOST): for res in socket.getaddrinfo(host, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) except socket.error as err: print ("Error:%s" %err) try: sock.connect(sa) except socket.error as msg: sock.close() continue if sock is None: print ('Failed to open socket!') sys.exit(1) msg = "Hello from ipv6 client" print ("Send data to server: %s" %msg) sock.send(bytes(msg.encode('utf-8'))) while True: data = sock.recv(BUFSIZE) print ('Received from server', repr(data)) if not data: break sock.close() if __name__ == '__main__': parser = argparse.ArgumentParser(description='IPv6 socket client example') parser.add_argument('--port', action="store", dest="port", type=int, required=True) given_args = parser.parse_args() port = given_args.port ipv6_echo_client(port) The server output is as follows: $ python 3_12a_ipv6_echo_server.py --port=8800 Server lisenting on localhost:8800 ('Connected to', ('127.0.0.1', 56958)) Received data from the client: [Hello from ipv6 client] Sent data echoed back to the client: [Hello from ipv6 client] The client output is as follows: $ python 3_12b_ipv6_echo_client.py --port=8800 Send data to server: Hello from ipv6 client ('Received from server', "'Hello from ipv6 client'") The following screenshot indicates the server and client output: How it works... The IPv6 echo server first determines its IPv6 information by calling socket.getaddrinfo(). Notice that we passed the AF_UNSPEC protocol for creating a TCP socket. The resulting information is a tuple of five values. We use three of them, address family, socket type, and protocol, to create a server socket. Then, this socket is bound with the socket address from the previous tuple. It then listens to the incoming connections and accepts them. After a connection is made, it receives data from the client and echoes it back. On the client-side code, we create an IPv6-compliant client socket instance and send the data using the send() method of that instance. When the data is echoed back, the recv() method is used to get it back. Summary In this article, the author has tried to explain certain recipes that explains the various IPv6 utilities in Python including an IPv6 client/server. Also some other protocols like ICMP ping and their working is touched upon throroughly. Scapy is explained so as to give a even better understanding about its popularity amongst the network Python programmers. Resources for Article: Further resources on this subject: Introduction to Network Security [article] Getting Started with Cisco UCS and Virtual Networking [article] Revisiting Linux Network Basics [article]
Read more
  • 0
  • 0
  • 22745

article-image-debugging-yournet
Packt
21 Feb 2018
11 min read
Save for later

Debugging Your.Net

Packt
21 Feb 2018
11 min read
In this article,Ya nnick Lefebvre, author of Wordpress Plugin Development Cookbook - Second Edition, we will cover the following recipes: Creating a new shortcode with parameters Managing multiple sets of user settings from a single admin page WordPress shortcodes are a simple, yet powerful tool that can be used to automate the insertion of code into web pages. For example, a shortcode could be used to automate the insertion of videos from a third-party platform that is not supported natively by WordPress, or embed content from a popular web site. By following the two code samples found in this article, you will learn how to create a WordPress plugin that defines your own shortcode to be able to quickly embed Twitter feeds on a web site. You will also learn how to create an administration configuration panel to be able to create a set of configurations that can be referenced when using your newly-created shortcode. Creating a new shortcode with parameters While simple shortcodes already provide a lot of potential to output complex content to a page by entering a few characters in the post editor, shortcodes become even more useful when they are coupled with parameters that will be passed to their associated processing function. Using this technique, it becomes very easy to create a shortcode that accelerates the insertion of external content in WordPress posts or pages by only needing to specify the shortcode and the unique identifier of the source element to be displayed. We will illustrate this concept in this recipe by creating a shortcode that will be used to quickly add Twitter feeds to posts or pages. How to do it... Navigate to the WordPress plugin directory of your development installation. Create a new directory called ch2-twitter-embed. Navigate to this directory and create a new text file called ch2-twitter-embed.php. Open the new file in a code editor and add an appropriate header at the top of the plugin file, naming the plugin Chapter 2 - Twitter Embed. Add the following line of code to declare a new shortcode and specify the name of the function that should be called when the shortcode is found in posts or pages: add_shortcode( 'twitterfeed', 'ch2te_twitter_embed_shortcode' ); Add the following code section to provide an implementation for the                     ch2te_twitter_embed_shortcode function:function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre' ), $atts ) ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '">Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Save and close the plugin file. Log in to the administration page of your development WordPress installation. Click on Plugins in the left-hand navigation menu. Activate your new plugin. Create a new page and use the shortcode [twitterfeed user_name='WordPress'] in the page editor, where WordPress is the Twitter username of the feed to display: Save and view the page to see that the shortcode was replaced by an embedded Twitter feed on your site. Edit the page and remove the user_name parameter and its associated value, only leaving the core [twitterfeed] shortcode in the post and Save. Refresh the page and see that the feed is still being displayed but now shows tweets from another account. How it works... When shortcodes are used with parameters, these extra pieces of data are sent to the associated processing function in the $atts parameter variable. By using a combination of the standard PHP extract and WordPress-specific shortcode_atts functions, our plugin is able to parse the data sent to the shortcode and create an array of identifiers and values that are subsequently transformed into PHP variables that we can use in the rest of our shortcode implementation function. In this specific example, we expect a single variable to be used, called user_name, which will be stored in a PHP variable called $user_name. If the user enters the shortcode without any parameter, a default value of ylefebvre will be assigned to the username variable to ensure that the plugin still works. Since we are going to accept user input in this code, we also verify that the user did not provide an empty string and we use the esc_html and esc_url functions to remove any potentially harmful HTML characters from the input string and make sure that the link destination URL is valid. Once we have access to the twitter username, we can put together the required HTML code that will embed a Twitter feed in our page and display the selected user's tweets. While this example only has one argument, it is possible to define multiple parameters for a shortcode. Managing multiple sets of user settings from a single admin page Throughout this article, you have learned how to create configuration pages to manage single sets of configuration options for our plugins. In some cases, only being able to specify a single set of options will not be enough. For example, looking back at the Twitter embed shortcode plugin that was created, a single configuration panel would only allow users to specify one set of options, such as the desired twitter feed dimensions or the number of tweets to display. A more flexible solution would be to allow users to specify multiple sets of configuration options, which could then be called up by using an extra shortcode parameter (for example, [twitterfeed user_name="WordPress" option_id="2"]). While the first thought that might cross your mind to configure such a plugin is to create a multi-level menu item with submenus to store a number of different settings, this method would produce a very awkward interface for users to navigate. A better way is to use a single panel but give the user a way to select between multiple sets of options to be modified. In this recipe, you will learn how to enhance the previously created Twitter feed shortcode plugin to be able to control the embedded feed size and number of tweets to display from the plugin configuration panel and to give users the ability to specify multiple display sizes. Getting ready You should have already followed the Creating a new shortcode with parameters recipe in the article to have a starting point for this recipe. Alternatively, you can get the resulting code (Chapter 2/ch2-twitter-embed/ch2-twitter-embed.php) from the downloaded code bundle. How to do it... Navigate to the ch2-twitter-embed folder of the WordPress plugin directory of your development installation. Open the ch2-twitter-embed.php file in a text editor. Add the following lines of code to implement an activation callback to initialize plugin options when it is installed or upgraded: ch2te_twitter_embed_shortcode function:function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre' ), $atts ) ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '">Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Insert the following code segment to register a function to be called when the administration menu is put together. When this happens, the callback function adds an item to the Settings menu and specifies the function to be called to render the configuration page: ch2te_twitter_embed_shortcode function:function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre' ), $atts ) ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '">Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Add the following code to implement the configuration page rendering function: ch2te_twitter_embed_shortcode function:function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre' ), $atts ) ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '">Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Add the following block of code to register a function that will process user options when submitted to the site: add_action( 'admin_init', 'ch2te_admin_init' ); function ch2te_admin_init() { add_action( 'admin_post_save_ch2te_options', 'process_ch2te_options' ); } Add the following code to implement the process_ch2te_options function, declared in the previous block of code, and to declare a utility function used to clean the redirection path: function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre', 'option_id' => '1' ), $atts ) ); if ( intval( $option_id ) < 1 || intval( $option_id ) > 5 ) { $option_id = 1; } $options = ch2te_get_options( $option_id ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '" data-width="' . $options['width'] . '" '; $output .= 'data-tweet-limit="' . $options['number_of_tweets']; $output .= '">' . 'Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Find the ch2te_twitter_embed_shortcode function and modify it as follows to accept the new option_id parameter and load the plugin options to produce the desired output. The changes are identified in bold within the recipe: function ch2te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre', 'option_id' => '1' ), $atts ) ); if ( intval( $option_id ) < 1 || intval( $option_id ) > 5 ) { $option_id = 1; } $options = ch2te_get_options( $option_id ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '" data-width="' . $options['width'] . '" '; $output .= 'data-tweet-limit="' . $options['number_of_tweets']; $output .= '">' . 'Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; } Save and close the plugin file. Deactivate and then Activate the Chapter 2 - Twitter Embed plugin from the administration interface to execute its activation function and create default settings. Navigate to the Settings menu and select the Twitter Embed submenu item to see the newly created configuration panel with the first set of options being displayed and more sets of options accessible through the drop-down list shown at the top of the page. To select the set of options to be used, add the parameter option_id to the shortcode used to display a Twitter feed, as follows: [twitterfeed user_name="WordPress" option_id="1"] How it works... This recipe shows how we can leverage options arrays to create multiple sets of options simply by creating the name of the options array on the fly. Instead of having a specific option name in the first parameter of the get_option function call, we create a string with an option ID. This ID is sent through as a URL parameter on the configuration page and as a hidden text field when processing the form data. On initialization, the plugin only creates a single set of options, which is probably enough for most casual users of the plugin. Doing so will avoid cluttering the site database with useless options. When the user requests to view one of the empty option sets, the plugin creates a new set of options right before rendering the options page. The rest of the code is very similar to the other examples that we saw in this article, since the way to access the array elements remains the same. Summary In this article, the author has explained about the entire process of how to create a new shortcode with parameters and how to manage multiple sets of user settings from a single admin page.   Resources for Article:   Further resources on this subject:Introduction to WordPress Plugin [article] Wordpress: Buddypress Courseware [article] Anatomy of a WordPress Plugin [article]
Read more
  • 0
  • 0
  • 13152

article-image-getting-started-raspberry-pi
Packt
21 Feb 2018
7 min read
Save for later

Getting Started on the Raspberry Pi

Packt
21 Feb 2018
7 min read
 In this article, by Soham Chetan Kamani, author of the book Full Stack Web Development with Raspberry Pi 3, we will cover the marvel of the Raspberry Pi, however, doesn’t end here. It’s extreme portability means we can now do things which were not previously possible with traditional desktop computers. The GPIO pins give us easy access to interface with external devices. This allows the Pi to act as a bridge between embedded electronics and sensors, and the power that linux gives us. In essence, we can run any code in our favorite programming language (which can run on linux), and interface it directly to outside hardware quickly and easily. Once we couple this with the wireless networking capabilities introduced in the Raspberry Pi 3, we gain the ability to make applications that would not have been feasible to make before this device existed.and Scar de Courcier, authors of Windows Forensics Cookbook The Raspberry Pi has become hugely popular as a portable computer, and for good reason. When it comes to what you can do with this tiny piece of technology, the sky’s the limit. Back in the day, computers used to be the size of entire neighborhood blocks, and only large corporations doing expensive research could afford them. After that we went on to embrace personal computers, which were still a bit expensive, but, for the most part, could be bought by the common man. This brings us to where we are today, where we can buy a fully functioning Linux computer, which is as big as a credit card, for under 30$. It is truly a huge leap in making computers available to anyone and everyone. (For more resources related to this topic, see here.)  Web development and portable computing have come a long way. A few years ago we couldn’t dream of making a rich, interactive, and performant application which runs on the browser. Today, not only can we do that, but also do it all in the palm of our hands (quite literally). When we think of developing an application that uses databases, application servers, sockets, and cloud APIs, the picture that normally comes to mind is that of many server racks sitting in a huge room. In this book however, we are going to implement all of that using only the Raspberry Pi. In this article, we will go through the concept of the internet of things, and discuss how web development on the Raspberry Pi can help us get there. Following this, we will also learn how to set up our Raspberry Pi and access it from our computer. We will cover the following topics: The internet of things Our application Setting up Raspberry Pi Remote access The Internet of things (IOT) The web has until today been a network of computers exchanging data. The limitation of this was that it was a closed loop. People could send and receive data from other people via their computers, but rarely much else. The internet of things, in contrast, is a network of devices or sensors that connect the outside world to the internet. Superficially, nothing is different: the internet is still a network of computers. What has changed, is that now, these computers are collecting and uploading data from things instead of people. This now allows anyone who is connected to obtain information that is not collected by a human. The internet of things as a concept has been around for a long time, but it is only now that almost anyone can connect a sensor or device to the cloud, and the IOT revolution was hugely enabled by the advent of portable computing, which was led by the Raspberry Pi.  A brief look at our application Throughout this book, we are going to go through different components and aspects of web development and embedded systems. These are all going to be held together by our central goal of making an entire web application capable of sensing and displaying the surrounding temperature and humidity. In order to make a properly functioning system, we have to first build out the individual parts. More difficult still, is making sure all the parts work well together. Keeping this in mind, let's take a look at the different components of our technology stack, and the problems that each of them solve : The sensor interface - Perception The sensor is what connects our otherwise isolated application to the outside world. The sensor will be connected to the GPIO pins of the Raspberry pi. We can interface with the sensor through various different native libraries. This is the starting point of our data. It is where all the data that is used by our application is created. If you think about it, every other component of our technology stack only exists to manage, manipulate, and display the data collected from the sensor. The database - Persistence "Data" is the term we give to raw information, which is information that we cannot easily aggregate or understand. Without a way to store and meaningfully process and retrieve this data, it will always remain "data" and never "information", which is what we actually want. If we just hook up a sensor and display whatever data it reads, we are missing out on a lot of additional information. Let's take the example of temperature: What if we wanted to find out how the temperature was changing over time? What if we wanted to find the maximum and minimum temperatures for a particular day, or a particular week, or even within a custom duration of time? What if we wanted to see temperature variation across locations? There is no way we could do any of this with only the sensor. We also need some sort of persistence and structure to our data, and this is exactly what the database provides for us. If we structure our data correctly, getting the answers to the above questions is just a matter of a simple database query. The user interface - Presentation The user interface is the layer which connects our application to the end user. One of the most challenging aspects of software development is to make information meaningful and understandable to regular users of our application. The UI layer serves exactly this purpose: it takes relevant information and shows it in such a way that it is easily understandable to humans. How do we achieve such a level of understandability with such a large amount of data? We use visual aids: like colors, charts and diagrams (just like how the diagrams in this book make its information easier to understand). An important thing for any developer to understand is that your end user actually doesn't care about any of the the back-end stuff. The only thing that matters to them is a good experience. Of course, all the 0ther components serve to make the users experience better, but it's really the user facing interface that leaves the first impression, and that's why it's so important to do it well. The application server - Middleware This layer consists of the actual server side code we are going to write to get the application running. It is also called "middleware". In addition to being in the exact center of the architecture diagram, this layer also acts as the controller and middle-man for the other layers. The HTML pages that form the UI are served through this layer. All the database queries that we were talking about earlier are made here. The code that runs in this layer is responsible for retrieving the sensor readings from our external pins and storing the data in our database. Summary We are just warming up! In this article we got a brief introduction to the concept of the internet of things. We then went on to look at an overview of what we were going to build throughout the rest of this book, and saw how the Raspberry Pi can help us get there. Resources for Article:   Further resources on this subject: Clusters, Parallel Computing, and Raspberry Pi – A Brief Background [article] Setting up your Raspberry Pi [article] Welcome to JavaScript in the full stack [article]
Read more
  • 0
  • 0
  • 10346

article-image-classify-emails-using-deep-neural-networks-generating-tf-idf
Savia Lobo
21 Feb 2018
9 min read
Save for later

How to classify emails using deep neural networks after generating TF-IDF

Savia Lobo
21 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book will teach you how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more. You will also learn how to analyze sentence structures and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and application of deep learning techniques.[/box] In this article, you will learn how to use deep neural networks to classify emails into one of the 20 pre-trained categories based on the words present in each email. This is a simple model to start with understanding the subject of deep learning and its applications on NLP. Getting ready The 20 newsgroups dataset from scikit-learn have been utilized to illustrate the concept. Number of observations/emails considered for analysis are 18,846 (train observations - 11,314 and test observations - 7,532) and its corresponding classes/categories are 20, which are shown in the following: >>> from sklearn.datasets import fetch_20newsgroups >>> newsgroups_train = fetch_20newsgroups(subset='train') >>> newsgroups_test = fetch_20newsgroups(subset='test') >>> x_train = newsgroups_train.data >>> x_test = newsgroups_test.data >>> y_train = newsgroups_train.target >>> y_test = newsgroups_test.target >>> print ("List of all 20 categories:") >>> print (newsgroups_train.target_names) >>> print ("n") >>> print ("Sample Email:") >>> print (x_train[0]) >>> print ("Sample Target Category:") >>> print (y_train[0]) >>> print (newsgroups_train.target_names[y_train[0]]) In the following screenshot, a sample first data observation and target class category has been shown. From the first observation or email we can infer that the email is talking about a two-door sports car, which we can classify manually into autos category which is 8. Note: Target value is 7 due to the indexing starts from 0), which is validating our understanding with actual target class 7. How to do it… Using NLP techniques, we have pre-processed the data for obtaining finalized word vectors to map with final outcomes spam or ham. Major steps involved are:    Pre-processing.    Removal of punctuations.    Word tokenization.    Converting words into lowercase.    Stop word removal.    Keeping words of length of at least 3.    Stemming words.    POS tagging.    Lemmatization of words: TF-IDF vector conversion. Deep learning model training and testing. Model evaluation and results discussion. How it works... The NLTK package has been utilized for all the pre-processing steps, as it consists of all the necessary NLP functionality under one single roof: # Used for pre-processing data >>> import nltk >>> from nltk.corpus import stopwords >>> from nltk.stem import WordNetLemmatizer >>> import string >>> import pandas as pd >>> from nltk import pos_tag >>> from nltk.stem import PorterStemmer The function written (pre-processing) consists of all the steps for convenience. However, we will be explaining all the steps in each section: >>> def preprocessing(text): The following line of the code splits the word and checks each character to see if it contains any standard punctuations, if so it will be replaced with a blank or else it just don't replace with blank: ... text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split()) The following code tokenizes the sentences into words based on whitespaces and puts them together as a list for applying further steps: ... tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)] Converting all the cases (upper, lower and proper) into lower case reduces duplicates in corpus: ... tokens = [word.lower() for word in tokens] As mentioned earlier, Stop words are the words that do not carry much of weight in understanding the sentence; they are used for connecting words and so on. We have removed them with the following line of code: ... stopwds = stopwords.words('english') ... tokens = [token for token in tokens if token not in stopwds] Keeping only the words with length greater than 3 in the following code for removing small words which hardly consists of much of a meaning to carry; ... tokens = [word for word in tokens if len(word)>=3] Stemming applied on the words using Porter stemmer which stems the extra suffixes from the words: ... stemmer = PorterStemmer() ... tokens = [stemmer.stem(word) for word in tokens] POS tagging is a prerequisite for lemmatization, based on whether word is noun or verb or and so on. it will reduce it to the root word ... tagged_corpus = pos_tag(tokens) pos_tag function returns the part of speed in four formats for Noun and six formats for verb. NN - (noun, common, singular), NNP - (noun, proper, singular), NNPS - (noun, proper, plural), NNS - (noun, common, plural), VB - (verb, base form), VBD - (verb, past tense), VBG - (verb, present participle), VBN - (verb, past participle), VBP - (verb, present tense, not 3rd person singular), VBZ - (verb, present tense, third person singular) ... Noun_tags = ['NN','NNP','NNPS','NNS'] ... Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ'] ... lemmatizer = WordNetLemmatizer() The following function, prat_lemmatize, has been created only for the reasons of mismatch between the pos_tag function and intake values of lemmatize function. If the tag for any word falls under the respective noun or verb tags category, n or v will be applied accordingly in lemmatize function: ... def prat_lemmatize(token,tag): ...      if tag in Noun_tags: ...          return lemmatizer.lemmatize(token,'n') ...      elif tag in Verb_tags: ...          return lemmatizer.lemmatize(token,'v') ...      else: ...          return lemmatizer.lemmatize(token,'n') After performing tokenization and applied all the various operations, we need to join it back to form stings and the following function performs the same: ... pre_proc_text =   " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus]) ... return pre_proc_text Applying pre-processing on train and test data: >>> x_train_preprocessed = [] >>> for i in x_train: ... x_train_preprocessed.append(preprocessing(i)) >>> x_test_preprocessed = [] >>> for i in x_test: ... x_test_preprocessed.append(preprocessing(i)) # building TFIDF vectorizer >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', max_features= 10000,strip_accents='unicode', norm='l2') >>> x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense() >>> x_test_2 = vectorizer.transform(x_test_preprocessed).todense() After the pre-processing step has been completed, processed TF-IDF vectors have to be sent to the following deep learning code: # Deep Learning modules >>> import numpy as np >>> from keras.models import Sequential >>> from keras.layers.core import Dense, Dropout, Activation >>> from keras.optimizers import Adadelta,Adam,RMSprop >>> from keras.utils import np_utils The following image produces the output after firing up the preceding Keras code. Keras has been installed on Theano, which eventually works on Python. A GPU with 6 GB memory has been installed with additional libraries (CuDNN and CNMeM) for four to five times faster execution, with a choking of around 20% memory; hence only 80% memory out of 6 GB is available; The following code explains the central part of the deep learning model. The code is self- explanatory, with the number of classes considered 20, batch size 64, and number of epochs to train, 20: # Definition hyper parameters >>> np.random.seed(1337) >>> nb_classes = 20 >>> batch_size = 64 >>> nb_epochs = 20 The following code converts the 20 categories into one-hot encoding vectors in which 20 columns are created and the values against the respective classes are given as 1. All other classes are given as 0: >>> Y_train = np_utils.to_categorical(y_train, nb_classes) In the following building blocks of Keras code, three hidden layers (1000, 500, and 50 neurons in each layer respectively) are used, with dropout as 50% for each layer with Adam as an optimizer: #Deep Layer Model building in Keras #del model >>> model = Sequential() >>> model.add(Dense(1000,input_shape= (10000,))) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(500)) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(50)) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(nb_classes)) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer='adam') >>> print (model.summary()) The architecture is shown as follows and describes the flow of the data from a start of 10,000 as input. Then there are 1000, 500, 50, and 20 neurons to classify the given email into one of the 20 categories: The model is trained as per the given metrics: # Model Training >>> model.fit(x_train_2, Y_train, batch_size=batch_size, epochs=nb_epochs,verbose=1) The model has been fitted with 20 epochs, in which each epoch took about 2 seconds. The loss has been minimized from 1.9281 to 0.0241. By using CPU hardware, the time required for training each epoch may increase as a GPU massively parallelizes the computation with thousands of threads/cores: Finally, predictions are made on the train and test datasets to determine the accuracy, precision, and recall values: #Model Prediction >>> y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size) >>> y_test_predclass = model.predict_classes(x_test_2,batch_size=batch_size) >>> from sklearn.metrics import accuracy_score,classification_report >>> print ("nnDeep Neural Network - Train accuracy:"),(round(accuracy_score( y_train, y_train_predclass),3)) >>> print ("nDeep Neural Network - Test accuracy:"),(round(accuracy_score( y_test,y_test_predclass),3)) >>> print ("nDeep Neural Network - Train Classification Report") >>> print (classification_report(y_train,y_train_predclass)) >>> print ("nDeep Neural Network - Test Classification Report") >>> print (classification_report(y_test,y_test_predclass)) It appears that the classifier is giving a good 99.9% accuracy on the train dataset and 80.7% on the test dataset. We learned the classification of emails using DNNs(Deep Neural Networks) after generating TF-IDF. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook  to further analyze sentence structures and application of various deep learning techniques.    
Read more
  • 0
  • 0
  • 14680
article-image-handle-missing-data-ibm-spss-modeler
Amey Varangaonkar
21 Feb 2018
8 min read
Save for later

How to handle missing data in IBM SPSS Modeler

Amey Varangaonkar
21 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book IBM SPSS Modeler Essentials written by Keith McCormick and Jesus Salcedo. This book gets you up and running with the fundamentals of SPSS Modeler, a premium tool for data mining and predictive analytics.[/box] In today’s tutorial we will demonstrate how easy it is to work with missing values in a dataset using the SPSS Modeler. Missing data is different than other topics in data modeling that you cannot choose to ignore . This is because failing to make a choice just means you are using the default option for a procedure, which most of the time is not optimal. In fact, it is important to remember that every model deals with missing data in a certain way, and some modeling techniques handle missing data better than others. In SPSS Modeler, there are four types of missing data: Type of missing data Definition $Null$ value Applies only to numeric fields. This is a cell that is empty or has an illegal value White space Applies only to string fields. This is a cell that is empty or has spaces. Empty string Applies only to string fields. This is a cell that is empty. Empty string is a subset of white space Blank value This is predefined code, and it applies to any type of field The first step in dealing with missing data is to assess the type and amount of missing data for each field. Consider whether there is a pattern as to why data might be missing. This can help determine if missing values could have affected responses. Only then can we decide how to handle it. There are two problems associated with missing data, and these affect the quantity and quality of the data: Missing data reduces sample size (quantity) Responders may be different from non-responders (quality—there could be biased results) Ways to address missing data There are three ways to address missing data: Remove fields Remove cases Impute missing values It can be necessary at times to remove fields with a large proportion of missing values. The easiest way to remove fields is to use a Filter node (discussed later in the book), however you can also use the Data Audit node to do this. [box type="info" align="" class="" width=""]Note that in some cases missing data can be predictive of behavior, so it is important to assess the importance of a variable before removing a field.[/box] In some situations, it may be necessary to remove cases instead of fields. For example, you may be developing a predictive model to predict customers' purchasing behavior and you simply do not have enough information concerning new customers. The easiest way to remove cases would be to use a Select node (discussed in the next chapter); however, you can also use the Data Audit node to do this. Imputing missing values implies replacing values for fields. However, some people do not estimate values for categorical fields because it does not seem right. In general, it is easier to estimate missing values for numeric fields, such as age, where often analysts will use the mean, median, or mode. [box type="info" align="" class="" width=""]Note that it is not a good idea to estimate missing data if you are missing a large percentage of information for that field, because estimates will not be accurate. Typically, we try not to impute more than 5% of values.[/box] To close out of the Data Audit node: Click OK to return to the stream canvas Defining missing values in the Type node When working with missing data, the first thing you need to do is define the missing data so that Modeler knows there is missing data, otherwise Modeler will think that the missing data is another value for a field (which, in some situations, it is, as in our dataset, but quite often this is not the case). Although the Data Audit node provides a report of missing values, blank values need to be defined within a Type node (or Type tab of a source node) in order for these to be identified by the Data Audit node. The Type tab (or node) is the only place where users can define missing values (Missing column). [box type="info" align="" class="" width=""]Note that in the Type node, blank values and $null$ values are not shown; however, empty strings and white space are depicted by "" or " ".[/box] To define blank values: Edit the Var.File node. Click on the Types tab. Click on the Missing cell for the field Region. Select Specify in the Missing column. Click Define blanks. Selecting Define blanks chooses Null and White space (remember, Empty String is a subset of White space, so it is also selected), and in this way these types of missing data are specified. To specify a predefined code, or a blank value, you can add each individual value to a separate cell in the Missing values area, or you can enter a range of numeric values if they are consecutive. 6. Type "Not applicable" in the first Missing values cell. 7. Hit Enter: We have now specified that "Not applicable" is a code for missing data for the field Region. 8. Click OK. In our dataset, we will only define one field as having missing data. 9. Click on the Clear Values button. 10. Click on the Read Values button: The asterisk indicates that missing values have been defined for the field Region. Now Not applicable is no longer considered a valid value for the field Region, but it will still be shown in graphs and other output. However, models will now treat the category Not applicable as a missing value. 11. Click OK. Imputing missing values with the Data Audit node As we have seen, the Data Audit node allows you to identify missing values so that you can get a sense of how much missing data you have. However, the Data Audit node also allows you to remove fields or cases that have missing data, as well as providing several options for data imputation: Rerun the Data Audit node. Note that the field Region only has 15,774 valid cases now, because we have correctly identified that the Not applicable category was a predefined code for missing data. 2. Click on the Quality tab. We are not going to impute any missing values in this example because it is not necessary, but we are going to show you some of the options, since these will be useful in other situations. To impute missing values you first need to specify when you want to impute missing values. For example: 3. Click in the Impute when cell for the field Region. 4. Select the Blank & Null Values. Now you need to specify how the missing values will be imputed. 5. Click in the Impute Method cell for the field Region. 6. Select Specify. In this dialog box, you can specify which imputation method you want to use, and once you have chosen a method, you can then further specify details about the imputation. There are several imputation methods: Fixed uses the same value for all cases. This fixed value can be a constant, the mode, the mean, or a midpoint of the range (the options will vary depending on the measurement level of the field). Random uses a random (different) value based on a normal or uniform distribution. This allows for there to be variation for the field with imputed values. Expression allows you to create your own equation to specify missing values. Algorithm uses a value predicted by a C&R Tree model. We are not going to impute any values now so click Cancel. If we had selected an imputation method, we would then: Click on the field Region to select it. Click on the Generate menu. The Generate menu of the Data Audit node allows you to remove fields, remove cases, or impute missing values, explained as follows: Missing Values Filter Node: This removes fields with too much missing data, or keeps fields with missing data so that you can investigate them further Missing Values Select Node: This removes cases with missing data, or keeps cases with missing data so that you can further investigate Missing Values SuperNode: This imputes missing values: If we were going to impute values, we would then click Missing Values SuperNode. In this way you can impute missing values using SPSS Modeler, and it makes your analysis a lot more easier. If you found our post useful, make sure to check out our book IBM SPSS Modeler Essentials, for more information on data mining and generating hidden insights using the popular SPSS Modeler tool.  
Read more
  • 0
  • 0
  • 15133

article-image-some-basic-concepts-theano
Packt
21 Feb 2018
13 min read
Save for later

Some Basic Concepts of Theano

Packt
21 Feb 2018
13 min read
 In this article by Christopher Bourez, the author of the book Deep Learning with Theano, presents Theano as a compute engine, and the basics for symbolic computing with Theano. Symbolic computing consists in building graphs of operations that will be optimized later on for a specific architecture, using the computation libraries available for this architecture. (For more resources related to this topic, see here.) Although this article might sound far from practical application. Theano may be defined as a library for scientific computing; it has been available since 2007 and is particularly suited for deep learning. Two important features are at the core of any deep learning library: tensor operations, and the capability to run the code on CPU or GPU indifferently. These two features enable us to work with massive amount of multi-dimensional data. Moreover, Theano proposes automatic differentiation, a very useful feature to solve a wider range of numeric optimizations than deep learning problems. The content of the article covers the following points: Theano install and loading Tensors and algebra Symbolic programming Need for tensor Usually, input data is represented with multi-dimensional arrays: Images have three dimensions: The number of channels, the width and height of the image Sounds and times series have one dimension: The time length Natural language sequences can be represented by two dimensional arrays: The time length and the alphabet length or the vocabulary length In Theano, multi-dimensional arrays are implemented with an abstraction class, named tensor, with many more transformations available than traditional arrays in a computer language like Python. At each stage of a neural net, computations such as matrix multiplications involve multiple operations on these multi-dimensional arrays. Classical arrays in programming languages do not have enough built-in functionalities to address well and fastly multi-dimensional computations and manipulations. Computations on multi-dimensional arrays have known a long history of optimizations, with tons of libraries and hardwares. One of the most important gains in speed has been permitted by the massive parallel architecture of the Graphical Computation Unit (GPU), with computation ability on a large number of cores, from a few hundreds to a few thousands. Compared to the traditional CPU, for example a quadricore, 12-core or 32-core engine, the gain with GPU can range from a 5x to a 100x times speedup, even if part of the code is still being executed on the CPU (data loading, GPU piloting, result outputing). The main bottleneck with the use of GPU is usually the transfer of data between the memory of the CPU and the memory of the GPU, but still, when well programmed, the use of GPU helps bring a significant increase in speed of an order of magnitude. Getting results in days rather than months, or hours rather than days, is an undeniable benefit for experimentation. Theano engine has been designed to address these two challenges of multi-dimensional array and architecture abstraction from the beginning. There is another undeniable benefit of Theano for scientific computation: the automatic differentiation of functions of multi-dimensional arrays, a well-suited feature for model parameter inference via objective function minimization. Such a feature facilitates the experimentation by releasing the pain to compute derivatives, which might not be so complicated, but prone to many errors. Installing and loading Theano Conda package and environment manager The easiest way to install Theano is to use conda, a cross-platform package and environment manager. If conda is not already installed on your operating system, the fastest way to install conda is to download the miniconda installer from https://conda.io/miniconda.html. For example, for conda under Linux 64 bit and Python 2.7: wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh chmod +x Miniconda2-latest-Linux-x86_64.sh bash ./Miniconda2-latest-Linux-x86_64.sh   Conda enables to create new environments in which versions of Python (2 or 3) and the installed packages may differ. The conda root environment uses the same version of Python as the version installed on your system on which you installed conda. Install and run Theano on CPU Last, let’s install Theano: conda install theano Run a python session and try the following commands to check your configuration: >>> from theano import theano   >>> theano.config.device   'cpu'   >>> theano.config.floatX   'float64'   >>> print(theano.config) The last command prints all the configuration of Theano. The theano.config object contains keys to many configuration options. To infer the configuration options, Theano looks first at ~/.theanorc file, then at any environment variables available, which override the former options, last at the variable set in the code, that are first in the order of precedence: >>> theano.config.floatX='float32' Some of the properties might be read-only and cannot be changed in the code, but floatX property, that sets the default floating point precision for floats, is among properties that can be changed directly in the code. It is advised to use float32 since GPU have a long history without float64, float64 execution speed on GPU is slower, sometimes much slower (2x to 32x on latest generation Pascal hardware), and that float32 precision is enough in practice. GPU drivers and libraries Theano enables the use of GPU (graphic computation units), the units usually used to compute the graphics to display on the computer screen. To have Theano work on the GPU as well, a GPU backend library is required on your system. CUDA library (for NVIDIA GPU cards only) is the main choice for GPU computations. There exists also the OpenCL standard, which is opensource, but far less developed, and much more experimental and rudimentary on Theano. Most of the scientific computations still occur on NVIDIA cards today. If you have a NVIDIA GPU card, download CUDA from the NVIDIA website at https://developer.nvidia.com/cuda-downloads and install it. The installer will install the lastest version of the gpu drivers first if they are not already installed. It will install the CUDA library in /usr/local/cuda directory. Install the cuDNN library, a library by NVIDIA also, that offers faster implementations of some operations for the GPU To install it, I usually copy /usr/local/cuda directory to a new directory /usr/local/cuda-{CUDA_VERSION}-cudnn-{CUDNN_VERSION} so that I can choose the version of CUDA and cuDNN, depending on the deep learning technology I use, and its compatibility. In your .bashrc profile, add the following line to set $PATH and $LD_LIBRARY_PATH variables: export PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0-cudnn-5.1/lib64::/usr/local/cuda-8.0-cudnn-5.1/lib:$LD_LIBRARY_PATH Install and run Theano on GPU N-dimensional GPU arrays have been implemented in Python under 6 different GPU library (Theano/CudaNdarray,PyCUDA/ GPUArray,CUDAMAT/ CUDAMatrix, PYOPENCL/GPUArray, Clyther, Copperhead), are a subset of NumPy.ndarray. Libgpuarray is a backend library to have them in a common interface with the same property. To install libgpuarray with conda: conda install pygpu To run Theano in GPU mode, you need to configure the config.device variable before execution since it is a read-only variable once the code is run. With the environment variable THEANO_FLAGS: THEANO_FLAGS="device=cuda,floatX=float32" python >>> import theano Using cuDNN version 5110 on context None Mapped name None to device cuda: Tesla K80 (0000:83:00.0) >>> theano.config.device 'gpu' >>> theano.config.floatX 'float32' The first return shows that GPU device has been correctly detected, and specifies which GPU it uses. By default, Theano activates CNMeM, a faster CUDA memory allocator, an initial preallocation can be specified with gpuarra.preallocate option. At the end, my launch command will be: THEANO_FLAGS="device=cuda,floatX=float32,gpuarray.preallocate=0.8" python >>> from theano import theano Using cuDNN version 5110 on context None Preallocating 9151/11439 Mb (0.800000) on cuda Mapped name None to device cuda: Tesla K80 (0000:83:00.0)   The first line confirms that cuDNN is active, the second confirms memory preallocation. The third line gives the default context name (that is None when the flag device=cuda is set) and the model of the GPU used, while the default context name for the CPU will always be cpu. It is possible to specify a different GPU than the first one, setting the device to cuda0, cuda1,... for multi-GPU computers. It is also possible to run a program on multiple GPU in parallel or in sequence (when the memory of one GPU is not sufficient), in particular when training very deep neural nets. In this case, the context flag contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3 activates multiple GPU instead of one, and designate the context name to each GPU device to be used in the code. For example, on a 4-GPU instance: THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3,floatX=float32,gpuarray.preallocate=0.8" python >>> import theano Using cuDNN version 5110 on context None Preallocating 9177/11471 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9177/11471 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) Using cuDNN version 5110 on context dev2 Preallocating 9177/11471 Mb (0.800000) on cuda2 Mapped name dev2 to device cuda2: Tesla K80 (0000:87:00.0) Using cuDNN version 5110 on context dev3 Preallocating 9177/11471 Mb (0.800000) on cuda3 Mapped name dev3 to device cuda3: Tesla K80 (0000:88:00.0)   To assign computations to a specific GPU in this multi-GPU setting, the names we choose dev0, dev1, dev2, and dev3 have been mapped to each device (cuda0, cuda1, cuda2, cuda3). This name mapping enables to write codes that are independent of the underlying GPU assignments and libraries (CUDA or other). To keep the current configuration flags active at every Python session or execution without using environment variables, save your configuration in the ~/.theanorc file as: [global] floatX = float32 device = cuda0 [gpuarray] preallocate = 1 Now, you can simply run python command. You are now all set. Tensors In Python, some scientific libraries such as NumPy provide multi-dimensional arrays. Theano doesn't replace Numpy but works in concert with it. In particular, NumPy is used for the initialization of tensors. To perform the computation on CPU and GPU indifferently, variables are symbolic and represented by the tensor class, an abstraction, and writing numerical expressions consists in building a computation graph of Variable nodes and Apply nodes. Depending on the platform on which the computation graph will be compiled, tensors are replaced either: By a TensorType variable, which data has to be on CPU By a GpuArrayType variable, which data has to be on GPU That way, the code can be written indifferently of the platform where it will be executed. Here are a few tensor objects: Object class Number of dimensions Example theano.tensor.scalar 0-dimensional array 1, 2.5 theano.tensor.vector 1-dimensional array [0,3,20] theano.tensor.matrix 2-dimensional array [[2,3][1,5]] theano.tensor.tensor3 3-dimensional array [[[2,3][1,5]],[[1,2],[3,4]]] Playing with these Theano objects in the Python shell gives a better idea: >>> import theano.tensor as T   >>> T.scalar() <TensorType(float32, scalar)>   >>> T.iscalar() <TensorType(int32, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> T.dscalar() <TensorType(float64, scalar)> With a i, l, f, d letter in front of the object name, you initiate a tensor of a given type, integer32, integer64, floats32 or float64. For real-valued (floating point) data, it is advised to use the direct form T.scalar() instead of the f or d variants since the direct form will use your current configuration for floats: >>> theano.config.floatX = 'float64'   >>> T.scalar() <TensorType(float64, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> theano.config.floatX = 'float32'   >>> T.scalar() <TensorType(float32, scalar)> Symbolic variables either: Play the role of placeholders, as a starting point to build your graph of numerical operations (such as addition, multiplication): they receive the flow of the incoming data during the evaluation, once the graph has been compiled Represent intermediate or output results Symbolic variables and operations are both part of a computation graph that will be compiled either towards CPU or GPU for fast execution. Let's write a first computation graph consisting in a simple addition: >>> x = T.matrix('x')   >>> y = T.matrix('y')   >>> z = x + y   >>> theano.pp(z) '(x + y)'   >>> z.eval({x: [[1, 2], [1, 3]], y: [[1, 0], [3, 4]]}) array([[ 2., 2.],        [ 4., 7.]], dtype=float32) At first place, two symbolic variables, or Variable nodes are created, with names x and y, and an addition operation, an Apply node, is applied between both of them, to create a new symbolic variable, z, in the computation graph. The pretty print function pp prints the expression represented by Theano symbolic variables. Eval evaluates the value of the output variable z, when the first two variables x and y are initialized with two numerical 2-dimensional arrays. The following example explicit the difference between the variables x and y, and their names x and y: >>> a = T.matrix()   >>> b = T.matrix()   >>> theano.pp(a + b) '(<TensorType(float32, matrix)> + <TensorType(float32, matrix)>)' Without names, it is more complicated to trace the nodes in a large graph. When printing the computation graph, names significantly helps diagnose problems, while variables are only used to handle the objects in the graph: >>> x = T.matrix('x')   >>> x = x + x   >>> theano.pp(x) '(x + x)' Here the original symbolic variable, named x, does not change and stays part of the computation graph. x + x creates a new symbolic variable we assign to the Python variable x. Note also, that with names, the plural form initializes multiple tensors at the same time: >>> x, y, z = T.matrices('x', 'y', 'z') Now, let's have a look at the different functions to display the graph. Summary Thus, this article helps us to give a brief idea on how to download and install Theano on various platforms along with the packages such as NumPy and SciPy. Resources for Article:   Further resources on this subject: Introduction to Deep Learning [article] Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article]
Read more
  • 0
  • 0
  • 5151

article-image-vmware-vsphere-storage-datastores-snapshots
Packt
21 Feb 2018
9 min read
Save for later

VMware vSphere storage, datastores, snapshots

Packt
21 Feb 2018
9 min read
VMware vSphere storage, datastores, snapshotsIn this article, byAbhilash G B, author of the book,VMware vSphere 6.5 CookBook - Third Edition, we will cover the following:Managing VMFS volumes detected as snapshotsCreating NFSv4.1 datastores with Kerberos authenticationEnabling storage I/O control (For more resources related to this topic, see here.) IntroductionStorage is an integral part of any infrastructure. It is used to store the files backing your virtual machines. The most common way to refer to a type of storage presented to a VMware environment is based on the protocol used and the connection type. NFS are storage solutions that can leverage the existing TCP/IP network infrastructure. Hence, they are referred to as IP-based storage. Storage IO Control (SIOC) is one of the mechanisms to use ensure a fair share of storage bandwidth allocation to all Virtual Machines running on shared storage, regardless of the ESXi host the Virtual Machines are running on. Managing VMFS volumes detected as snapshotsSome environments maintain copies of the production LUNs as a backup, by replicating them. These replicas are exact copies of the LUNs that were already presented to the ESXi hosts. If for any reason a replicated LUN is presented to an ESXi host, then the host will not mount the VMFS volume on the LUN. This is a precaution to prevent data corruption. ESXi identifies each VMFS volume using its signature denoted by aUniversally Unique Identifier (UUID). The UUID is generated when the volume is first created or resignatured and is stored in the LVM header of the VMFS volume. When an ESXi host scans for new LUN ;devices and VMFS volumes on it, it compares the physical device ID (NAA ID) of the LUN with the device ID (NAA ID) value stored in the VMFS volumes LVM header. If it finds a mismatch, then it flags the volume as a snapshot volume.Volumes detected as snapshots are not mounted by default. There are two ways to mount such volumes/datastore:Mount by Keeping the Existing Signature Intact - This is used when you are attempting to temporarily mount the snapshot volume on an ESXi that doesn't see the original volume. If you were to attempt mounting the VMFS volume by keeping the existing signature and if the host sees the original volume, then you will not be allowed to mount the volume and will be warned about the presence of another VMFS volume with the same UUID:Mount by generating a new VMFS Signature - This has to be used if you are mounting a clone or a snapshot of an existing VMFS datastore to the same host/s. The process of assigning a new signature will not only update the LVM header with the newly generated UUID, but all the Physical Device ID (NAA ID) of the snapshot LUN. Here, the VMFS volume/datastore will be renamed by prefixing the wordsnap followed by a random number and the name of the original datastore: Getting readyMake sure that the original datastore and its LUN is no longer seen by the ESXi host the snapshot is being mounted to. How to do it...The following procedure will help mount a VMFS volume from a LUN detected as a snapshot:Log in to the vCenter Server using the vSphere Web Client and use the key combination Ctrl+Alt+2 to switch to the Host and Clusters view.Right click on the ESXi host the snapshot LUN is mapped to and go to Storage | New Datastore.On the New Datastore wizard, select VMFS as the filesystem type and click Next to continue.On the Name and Device selection screen, select the LUN detected as a snaphsot and click Next to continue:On the Mount Option screen, choose to either mount by assigning a new signature or by keeping the existing signature, and click Next to continue:On the Ready to Complete screen, review the setting and click Finish to initiate the operation. Creating NFSv4.1 datastores with Kerberos authenticationVMware introduced support for NFS 4.1 with vSphere 6.0. The vSphere 6.5 added several enhancements:It now supports AES encryptionSupport for IP version 6Support Kerberos's integrity checking mechanismHere, we will learn how to create NFS 4.1 datastores. Although the procedure is similar to NFSv3, there are a few additional steps that needs to be performed. Getting readyFor Kerberos authentication to work, you need to make sure that the ESXi hosts and the NFS Server is joined to the Active Directory domainCreate a new or select an existing AD user for NFS Kerberos authenticationConfigure the NFS Server/Share to Allow access to the AD user chosen for NFS Kerberos authentication How to do it...The following procedure will help you mount an NFS datasture using the NFSv4.1 client with Kerberos authentication enabled:Log in to the vCenter Server using the vSphere Web Client and use the key combination Ctrl+Alt+2 to switch to the Host and Clusters view, select the desired ESXi host and navigate to it  Configure | System | Authentication Services section and supply the credentials of the Active Directory user that was chosen for NFS Kerberon Authentication:Right-click on the desired ESXi host and go to Storage | New Datastore to bring-up the Add Storage wizard.On the New Datastore wizard, select the Type as NFS and click Next to continue.On the Select NFS version screen, select NFS 4.1 and click Next to continue. Keep in mind that it is not recommended to mount an NFS Export using both NFS3 and NFS4.1 client. On the Name and Configuration screen, supply a Name for the Datastore, the NFS export's folder path and NFS Server's IP Address or FQDN. You can also choose to mount the share as ready-only if desired:On the Configure Kerberos Authentication screen, check the Enable Kerberos-based authentication box and choose the type of authentication required and click Next to continue:On the Ready to Complete screen review the settings and click Finish to mount the NFS export. Enabling storage I/O controlThe use of disk shares will work just fine as long as the datastore is seen by a single ESXi host. Unfortunately, that is not a common case. Datastores are often shared among multiple ESXi hosts. When datastores are shared, you bring in more than one local host scheduler into the process of balancing the I/O among the virtual machines. However, these lost host schedules cannot talk to each other and their visibility is limited to the ESXi hosts they are running on. This easily contributes to a serious problem called thenoisy neighbor situation. The job of SIOC is to enable some form of communication between local host schedulers so that I/O can be balanced between virtual machines running on separate hosts.  How to do it...The following procedure will help you enable SIOC on a datastore:Connect to the vCenter Server using the Web Client and switch to the Storage view using the key combination Ctrl+Alt+4.Right-click on the desired datastore and go to Configure Storage I/O Control:On the Configure Storage I/O Control window, select the checkbox Enable Storage I/O Control, set a custom congestion threshold (only if needed) and click OK to confirm the settings: With the Virtual Machine selected from the inventory, navigate to its Configure | General tab and review its datastore capability settings to ensure that SIOC is enabled:  How it works...As mentioned earlier, SIOC enables communication between these local host schedulers so that I/O can be balanced between virtual machines running on separate hosts. It does so by maintaining a shared file in the datastore that all hosts can read/write/update. When SIOC is enabled on a datastore, it starts monitoring the device latency on the LUN backing the datastore. If the latency crosses the threshold, it throttles the LUN's queue depth on each of the ESXi hosts in an attempt to distribute a fair share of access to the LUN for all the Virtual Machines issuing the I/O.The local scheduler on each of the ESXi hosts maintains an iostats file to keep its companion hosts aware of the device I/O statistics observed on the LUN. The file is placed in a directory (naa.xxxxxxxxx) on the same datastore.For example, if there are six virtual machines running on three different ESXi hosts, accessing a shared LUN. Among the six VMs, four of them have a normal share value of 1000 and the remaining two have high (2000) disk share value sets on them. These virtual machines have only a single VMDK attached to them. VM-C on host ESX-02 is issuing a large number of I/O operations. Since that is the only VM accessing the shared LUN from that host, it gets the entire queue's bandwidth. This can induce latency on the I/O operations performed by the other VMs: ESX-01 and ESX-03. If the SIOC detects the latency value to be greater than the dynamic threshold, then it will start throttling the queue depth: The throttled DQLEN for a VM is calculated as follows:DQLEN for the VM = (VM's Percent of Shares) of (Queue Depth)Example: 12.5 % of 64 → (12.5 * 64)/100 = 8The throttled DQLEN per host is calculated as follows:DQLEN of the Host = Sum of the DQLEN of the VMs on itExample: VM-A (8) + VM-B(16) = 24The following diagram shows the effect of SIOC throttling the queue depth: SummaryIn this article we learnt, how to mount a VMFS volume from a LUN detected as a snapshot, how to mount an NFS datasture using the NFSv4.1 client with Kerberos authentication enabled, and how to enable SIOC on a datastore. Resources for Article:   Further resources on this subject: Essentials of VMware vSphere [article] Working with VMware Infrastructure [article] Network Virtualization and vSphere [article]
Read more
  • 0
  • 0
  • 37975
article-image-play-functions
Packt
21 Feb 2018
6 min read
Save for later

Play With Functions

Packt
21 Feb 2018
6 min read
This article by Igor Wojda and Marcin Moskala, authors of the book Android Development with Kotlin, introduces functions in Kotlin, together with different ways of calling functions. (For more resources related to this topic, see here.) Single-expression functions During typical programming, many functions contain only one expression. Here is example of this kind of function: fun square(x: Int): Int { return x * x } Or another one, which can be often found in Android projects. It is pattern used in Activity, to define methods that are just getting text from some view or providing some other data from view to allow presenter to get them: fun getEmail(): String { return emailView.text.toString() } Both functions are defined to return result of single expression. In first example it is result of x * x multiplication, and in second one it is result of expression emailView.text.toString(). This kind of functions are used all around Android projects. Here are some common use-cases: Extracting some small operations Using polymorphism to provide values specific to class Functions that are only creating some object Functions that are passing data between architecture layers (like in preceding example Activity is passing data from view to presenter) Functional programming style functions that base on recurrence Such functions are often used, so Kotlin has notation for this kind of them. When a function returns a single expression, then curly braces and body of the function can be omitted. We specify expression directly using equality character. Functions defined this way are called single-expression functions. Let's update our square function, and define it as a single-expression function: As we can see, single expression function have expression body instead of block body. This notation is shorter, but whole body needs to be just a single expression. In single-expression functions, declaring return type is optional, because it can be inferred by the compiler from type of expression. This is why we can simplify use square function, and define it this way: fun square(x: Int) = x * x There are many places inside Android application where we can utilize single expression functions. Let's consider RecyclerView adapter that is providing layout ID and creating ViewHolder: class AddressAdapter : ItemAdapter<AddressAdapter.ViewHolder>() { override fun getLayoutId() = R.layout.choose_address_view override fun onCreateViewHolder(itemView: View) = ViewHolder(itemView) // Rest of methods } In the following example, we achieve high readability thanks to single expression function. Single-expression functions are also very popular in the functional world. Single expression function notation is also well-pairing with when structure. Here example of their connection, used to get specific data from object according to key (use-case from big Kotlin project): fun valueFromBooking(key: String, booking: Booking?) = when(key) { // 1 "patient.nin" -> booking?.patient?.nin "patient.email" -> booking?.patient?.email "patient.phone" -> booking?.patient?.phone "comment" -> booking?.comment else -> null } We don't need a type, because it is inferred from when expression. Another common Android example is that we can combine when expression with activity method onOptionsItemSelected that handles top bar menu clicks: override fun onOptionsItemSelected(item: MenuItem): Boolean = when { item.itemId == android.R.id.home -> { onBackPressed() true } else -> super.onOptionsItemSelected(item) } As we can see, single expression functions can make our code more concise and improved readability. Single-expression functions are commonly used in Kotlin Android projects and they are really popular for functional programming. As an example. Let's suppose that we need to filter all the odd values from following list: val list = listOf(1, 2, 3, 4, 5) We will use following helper function that returns true if argument is odd otherwise it returns false: fun isOdd(i: Int) = i % 2 == 1 In imperative programming style, we should specify steps of processing, which are connected to execution process (iterate through list, check if value is odd, add value to one list if it's odd). Here is implementation of this functionality, that is typical for imperative style: var oddList = emptyList<Int>() for(i in list) { if(isOdd(i)) { newList += i } } In declarative programming style, the way of thinking about code is different - we should think what is the required result and simply use functions that will give us this result. Kotlin stdlib provides lot of functions supporting declarative programming style. Here is how we could implement the same functionality using one of them, called filter: var oddList = list.filter(::isOdd) filter is function that leaves only elements that are true according to predicate. Here function isOdd is used as an predicate. Different ways of calling a function Sometimes we need to call function and provide only selected arguments. In Java we could create multiple overloads of the same method, but this solution have there are some limitations. First problem is that number of possible method permutations is growing very quickly (2n) making them very difficult to maintain. Second problem is that overloads must be distinguishable from each other, so compiler will know which overload to call, so when method defines few parameters with the same type we can't define all possible overloads. That's why in Java we often need to pass multiple null values to a method: // Java printValue("abc", null, null, "!"); Multiple null parameters provide boilerplate and greatly decrease method readability. In Kotlin there is no such problem, because Kotlin has feature called default arguments and named argument syntax. Default arguments values Default arguments are mostly known from C++, which is one of the oldest languages supporting it. Default argument provides a value for a parameter in case it is not provided during method call. Each function parameter can have default value. It might be any value that is matching specified type including null. This way we can simply define functions that can be called in multiple ways We can use this function the same way as normal function (function without default argument values) by providing values for each parameter (all arguments): printValue("str", true, "","") // Prints: (str) Thanks to default argument values, we can call a function by providing arguments only for parameters without default values: printValue("str") // Prints: (str) We can also provide all parameters without default values, and only some that have a default value: printValue("str", false) // Prints: str Named arguments syntax Sometimes we want only to pass value for last argument. Let's suppose that we define want to define value for suffix, but not for prefix and inBracket (which are defined before suffix). Normally we would have to provide values for all previous parameters including the default parameter values: printValue("str", true, true, "!") // Prints: (str) By using named argument syntax, we can pass specific argument using argument name: printValue("str", suffix = "!") // Prints: (str)! We can also use named argument syntax together with classic call. The only restriction is when we start using named syntax we cannot use classic one for next arguments we are serving: printValue("str", true, "") printValue("str", true, prefix = "") printValue("str", inBracket = true, prefix = "") Summary In this article, we learned about single expression functions as a type of defining functions in application development. We also briefly explained Resources for Article:   Further resources on this subject: Getting started with Android Development [article] Android Game Development with Unity3D [article] Kotlin Basics [article]
Read more
  • 0
  • 0
  • 30011

article-image-create-conversational-assistant-chatbot-using-python
Savia Lobo
21 Feb 2018
5 min read
Save for later

How to create a conversational assistant or chatbot using Python

Savia Lobo
21 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book includes unique recipes to teach various aspects of performing Natural Language Processing with NLTK—the leading Python platform for the task.[/box] Today we will learn to create a conversational assistant or chatbot using Python programming language. Conversational assistants or chatbots are not very new. One of the foremost of this kind is ELIZA, which was created in the early 1960s and is worth exploring. In order to successfully build a conversational engine, it should take care of the following things: 1. Understand the target audience 2. Understand the natural language in which communication happens.  3. Understand the intent of the user 4. Come up with responses that can answer the user and give further clues NLTK has a module, nltk.chat, which simplifies building these engines by providing a generic framework. Let's see the available engines in NLTK: Engines Modules Eliza nltk.chat.eliza Python module Iesha nltk.chat.iesha Python module Rude nltk.chat.rudep ython module Suntsu Suntsu nltk.chat.suntsu module Zen nltk.chat.zen module In order to interact with these engines we can just load these modules in our Python program and invoke the demo() function. This recipe will show us how to use built-in engines and also write our own simple conversational engine using the framework provided by the nltk.chat module. Getting ready You should have Python installed, along with the nltk library. Having an understanding of regular expressions also helps. How to do it...    Open atom editor (or your favorite programming editor).    Create a new file called Conversational.py.    Type the following source code:    Save the file.    Run the program using the Python interpreter.    You will see the following output: How it works... Let's try to understand what we are trying to achieve here. import nltk This instruction imports the nltk library into the current program. def builtinEngines(whichOne): This instruction defines a new function called builtinEngines that takes a string parameter, whichOne: if whichOne == 'eliza': nltk.chat.eliza.demo() elif whichOne == 'iesha': nltk.chat.iesha.demo() elif whichOne == 'rude': nltk.chat.rude.demo() elif whichOne == 'suntsu': nltk.chat.suntsu.demo() elif whichOne == 'zen': nltk.chat.zen.demo() else: print("unknown built-in chat engine {}".format(whichOne)) These if, elif, else instructions are typical branching instructions that decide which chat engine's demo() function is to be invoked depending on the argument that is present in the whichOne variable. When the user passes an unknown engine name, it displays a message to the user that it's not aware of this engine. It's a good practice to handle all known and unknown cases also; it makes our programs more robust in handling unknown situations def myEngine():. This instruction defines a new function called myEngine(); this function does not take any parameters. chatpairs = ( (r"(.*?)Stock price(.*)", ("Today stock price is 100", "I am unable to find out the stock price.")), (r"(.*?)not well(.*)", ("Oh, take care. May be you should visit a doctor", "Did you take some medicine ?")), (r"(.*?)raining(.*)", ("Its monsoon season, what more do you expect ?", "Yes, its good for farmers")), (r"How(.*?)health(.*)", ("I am always healthy.", "I am a program, super healthy!")), (r".*", ("I am good. How are you today ?", "What brings you here ?")) ) This is a single instruction where we are defining a nested tuple data structure and assigning it to chat pairs. Let's pay close attention to the data structure: We are defining a tuple of tuples Each subtuple consists of two elements: The first member is a regular expression (this is the user's question in regex format) The second member of the tuple is another set of tuples (these are the answers) def chat(): print("!"*80) print(" >> my Engine << ") print("Talk to the program using normal english") print("="*80) print("Enter 'quit' when done") chatbot = nltk.chat.util.Chat(chatpairs, nltk.chat.util.reflections) chatbot.converse() We are defining a subfunction called chat()inside the myEngine() function. This is permitted in Python. This chat() function displays some information to the user on the screen and calls the nltk built-in nltk.chat.util.Chat() class with the chatpairs variable. It passes nltk.chat.util.reflections as the second argument. Finally we call the chatbot.converse() function on the object that's created using the chat() class. chat() This instruction calls the chat() function, which shows a prompt on the screen and accepts the user's requests. It shows responses according to the regular expressions that we have built before: if   name    == '  main  ': for engine in ['eliza', 'iesha', 'rude', 'suntsu', 'zen']: print("=== demo of {} ===".format(engine)) builtinEngines(engine) print() myEngine() These instructions will be called when the program is invoked as a standalone program (not using import). They do these two things: Invoke the built-in engines one after another (so that we can experience them) Once all the five built-in engines are excited, they call our myEngine(), where our customer engine comes into play We have learned to create a chatbot of our own using the easiest programming language ‘Python’. To know more about how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, etc check out Natural Language Processing with Python Cookbook.
Read more
  • 0
  • 0
  • 50958
Modal Close icon
Modal Close icon